traffic scene anomaly detection

TRAFFIC SCENE ANOMALY DETECTION

LUI CAI AN

A project report submitted in partial fulfilment of the

requirements for the award of Bachelor of Engineering

(Hons.) Software Engineering

Lee Kong Chian Faculty of Engineering and Science

UniversitiTunku Abdul Rahman

April 2017

ii

DECLARATION

I hereby declare that this project report is based on my original work except for

citations and quotations which have been duly acknowledged. I also declare that it has

not been previously and concurrently submitted for any other degree or award at

UTAR or other institutions.

Signature :

Name : Lui Cai An

ID No. : 1305946

Date :

iii

APPROVAL FOR SUBMISSION

I certify that this project report entitled “TRAFFIC SCENE ANOMALY

DETECTION” was prepared by LUI CAI AN has met the required standard for

submission in partial fulfilment of the requirements for the award of Bachelor of

Science (Hons.) Software Engineering at Universiti Tunku Abdul Rahman

Approved by,

Signature :

Supervisor : Dr. Tay Yong Haur

Date :

Signature :

Co-Supervisor :

Date :

iv

The copyright of this report belongs to the author under the terms of the

copyright Act 1987 as qualified by Intellectual Property Policy of Universiti Tunku

Abdul Rahman. Due acknowledgement shall always be made of the use of any material

contained in, or derived from, this report.

© 2017, Lui Cai An. All right reserved.

v

ACKNOWLEDGEMENTS

I would like to thank everyone who had contributed to the successful completion of

this project. I would like to express my gratitude to my research supervisor, Dr. Tay

Yong Haur for his invaluable advice, guidance and his enormous patience throughout

the development of the research.

Special thanks to Chong Yong Shean, who shared a lot experience and insight

to me related to current project. Several discussions and sharing session allows me to

understand more about the project. Without her, it might be more difficult in

undergoing each process in this project

Lastly, I would like to thanks to family members and friends who provide

supports and encouragement throughout the entire project. Without the positive

support and encouragement, the project will not be able to complete as scheduled.

vi

ABSTRACT

Abstracting and inspecting meaningful activities from long hour video is very

challenging. In the traditional approach in video analytic and anomaly detection, it

applies ruled-based approach where a set of rules have been predefined as the pool of

meaningful events that wanted to be detected. By using the ruled-based approach, it

limits the detection performance, often triggers false alarm, require extensive

configuration setup for each particular scenario and require heavy maintenance after

setup. To overcome the limitation of traditional approach, we propose deep learning

approach via data-driven approach using non-labelled data, training a spatiotemporal

autoencoder model by feeding with normal traffic scene dataset. The autoencoder

model consists of Convolutional Neural Network (CNN) and Long Short-Term

Memory (LSTM) for learning spatial and temporal characteristics in video dataset.

This unsupervised approach able to minimize and reduce dependency on human,

where only limited human supervision is required. By using the Plus Highway dataset,

the result shows the model able to perform anomaly detection.

vii

TABLE OF CONTENTS

DECLARATION ii

APPROVAL FOR SUBMISSION iii

ACKNOWLEDGEMENTS v

ABSTRACT vi

TABLE OF CONTENTS vii

LIST OF TABLES x

LIST OF FIGURES xi

LIST OF SYMBOLS / ABBREVIATIONS xiv

CHAPTER

1 INTRODUCTION 15

1.1 Background 15

1.2 Problem Statement 17

1.2.1 Tons of Data Generated from Time to Time 17

1.2.2 Existing Video Analytic Tools often triggers False

Alarm and Not Adaptive (Dalli n.d.). 18

1.2.3 Existing Video Analytic Tool involved Complicated

Configuration during Setup and Difficult to Maintain (Narciso

2014). 19

1.3 Project Objective 20

1.4 Scope 20

1.4.1 Deliverable 20

1.4.2 Modules Covered 20

1.4.3 Modules Not Covered 20

2 RELATED WORK 21

2.1 Existing Approaches 21

viii

2.1.1 Rule Based Approach Detection 21

2.2 Deep Learning Technique 23

2.2.1 Convolutional Neural Network (CNN) 23

2.2.2 Long Short Term Memory(LSTM) 23

2.2.3 Building Autoencoder 24

2.2.4 Application of Autoencoder in Anomaly

Detection 25

2.3 Summary 25

3 PROPOSED SOLUTION 26

3.1 Deep Learning Approach via Data Driven Approach 26

3.2 System Overview 28

3.3 Technology and Techniques Involved 30

3.3.1 Deep Learning Framework 30

3.3.2 Autoencoder Model Architecture 30

3.3.3 Datasets Pre-processing 31

3.3.4 Data Formatting 33

3.4 Evaluation Method 34

3.4.1 Training Phase Evaluation 34

3.4.2 Classification Phase Evaluation 34

4 EXPERIMENT, RESULT AND ANALYSIS 37

4.1 Dataset 37

4.1.1 Dataset Cropping 38

4.1.2 Data Augmentation 39

4.2 Batch Processing 40

4.3 Evaluation Phase 41

4.3.1 Training Phase Visualization 41

4.3.2 Reconstruction Error 42

4.3.3 Normalization 42

4.3.4 Regularity Score 43

4.4 Result 43

4.4.1 Video Version 1 43

ix



4.5 Ground Truth Labelling Method 52

4.6 Video Average Handling Method 53

4.7 Summary 58

4.7.1 Result Summary 58

4.7.2 Model Summary 59

5 CONCLUSION AND RECOMMENDATIONS 60

5.1 Conclusion 60

5.2 Recommendations for future work 61

5.2.1 Deeper Neural Network for Autoencoder Model 61

5.2.2 Train Model with Larger Dataset 61

5.2.3 Enhance Data Augmentation Technique 61

5.2.4 Dataset Expansion Covering Different Natural

Situation 61

5.2.5 Training Approach for Handling Different

Situation 62

REFERENCES 63

x

LIST OF TABLES

Table 3.1: Confusion Matrix 35

Table 3.2: Confusion Matrix with Description 35

Table 4.1: Result Summary 58

xi

LIST OF FIGURES

Figure 1.1: Control Room with Many CCTV Monitors 17

Figure 1.2: Video Analytic Tool using 2D Virtual Fencing

Approach (Facial Recognition & Video Analytic

Software 2016) 18

Figure 1.3: Surveillance Detected by Virtual Fencing Approach

(Facial Recognition & Video Analytic Software

2016) 18

Figure 2.1: Repeating Module in a Standard RNN 23

Figure 2.2: Repeating Module in a LSTM Containing 4 Layers 23

Figure 2.3: Autoencoder Sample (Chollet 2016) 24

Figure 3.1: Sample Traffic Scene with Normal and Abnormal

Driving Direction 27

Figure 3.2: System Overview 28

Figure 3.3: Model Architecture 30

Figure 3.4: Grayscale Image 32

Figure 3.5: Cropping Image 32

Figure 3.6: Image Resize 33

Figure 3.7: Image Frames Storing in Volume 33

Figure 3.8: ROC Curve 35

Figure 4.1: Sample Normal Traffic Scene 37

Figure 4.2: Original Traffic Scene 38

Figure 4.3: Static Area in Blue Highlighted 38

Figure 4.4: Cropped Traffic Scene 39

Figure 4.5: Job Killed Due to Memory Limit 40

Figure 4.6: Training Epochs Loss Value 41

Figure 4.7: Regularity Score Calculated for Video Version 1 43

xii

Figure 4.8: Large Vehicle Detected in Video Version 1 44

Figure 4.9: Bus Stopping Event Detected in Video Version 1 44

Figure 4.10: Other Bus Stopping Event Detected in Video Version

1 45

Figure 4.11: ROC Curve for Video Version 1 45


Figure 4.13: Taxi Stopping Event Detected in Video Version 2 46

Figure 4.14: Bus Stopping Event and Vehicle Stopping Behind

Signboard in Video Version 2 47

Figure 4.15: Large Vehicle Event Detected in Video Version 2 47

Figure 4.16: Bus Stopping Event Detected in Video Version 2 48

Figure 4.17: Traffic Condition with Less Car 48

Figure 4.18: ROC Curve for Video Version 2 48


Figure 4.20: Large Vehicle Detected in Blur Condition in Video

Version 3 50

Figure 4.21: Large Vehicle Detected in Video Version 3 50

Figure 4.22: Two Large Vehicle Detected in Video version 3 50

Figure 4.23: Scene with Light Reflection and Shadow in Video

Version 3 51

Figure 4.24: ROC Curve in Video Version 3 52

Figure 4.25: Scene with Unobvious Anomaly 52

Figure 4.26: ROC Curve Improvement with Obvious Scene Only 53

Figure 4.27: Overlapping Data Handling 54

Figure 4.28: Variance of Regularity Score Enhancement for Video

Version 1 56


Version 2 56

https://d.docs.live.net/ad8fbd1807fde808/FYP/FYP2/LuiCaiAn_1305946_FYP2.docx#_Toc479184067

xiii


Version 3 56

Figure 4.31: ROC Curve Enhancement for Video Version 1 57



Figure 5.1: Night Traffic Scene 62

xiv

LIST OF SYMBOLS / ABBREVIATIONS

2D 2 Dimensions

3D 3 Dimensions

CCTV Closed-Circuit Television

CNN Convolutional Neural Network

EER Error Equal Rate

HDF5 Hierarchical Data Format 5

ROC Receiving Operating Characteristics

RGB Red, Green, Blues

RNN Recurrent Neural Network

15

CHAPTER 1

1 INTRODUCTION

1.1 Background

In the current era, CCTV video cameras are everywhere, recording all the scene

happening for every second.

Monitoring and abstract meaningful events/activities in long-hour videos is

very hectic and challenging, human supervision is required to monitor the scene, which

is very time consuming and energy consuming. Due to human factor, we will become

tired and dizzy after watching long-hour video, where the video scene is almost the

same, except if something else happens. In short, even though meaningful

events/activities are only small ratio of the entire long -hour video, human still need to

monitor all the video sequences where most of the sequences are meaningless.

Not only that, by approaching video analysis via the traditional way, where

human supervision is needed to watch, learn and determine the motion patterns in the

video scene corresponding with the human nature and normal life cycle. However,

human might be missed out some of the significant/meaningful events/activities

displayed in the monitor due to human issues.

As years passed, there are more and more new video analytic tools developed

and invented using technology, to minimize the dependency on human. However, the

existing video analytic tools are not up to the point yet, because it requires complicated

configurations, high maintenance and often triggers false alarm, which ended up

causing human to be more busy handling alarm, where still not very effective and

efficient in handling video analytic problem.

As technology rises and grown drastically, deep learning appeared, and it able

to learn and behave like a human, able to perform and act like a real human and might

be performing even better than human. By using deep learning approach, we built a

model that used to detect any anomaly in the video scene.

Before starting the analytic and detecting process, the model will be trained

with normal scene video scenes first. After training the model, the experienced model

will be capable to detect abnormal scene, which are scene that have not been seen in

the normal video scene. The model will learn and calculate regularity score for each

image frames and video scene, using the spatial-temporal (space and time)

16

characteristics, where human supervision is not required to watch all the video scene.

If any scene is calculated to have low regularity score, it means that the particular scene

is something that the model never seen before, an anomaly alarm will be triggered

(Hasan et al., 2016).

By applying deep learning approach in video analytic, machine is able to

performing video analysis, rather than human watch all the long-hour video one by

one, minimizing the workload and dependencies on human. At the meantime, better

and high-end hardware is required in order to perform analysis faster.

17

1.2 Problem Statement

1.2.1 Tons of Data Generated from Time to Time

In the current era, video camera/surveillance camera are everywhere, recording video

for every second, generating more and more data, where most of the video scene

dataset is just for the sake of recording, only very small ratio of the video dataset is

being processed and analysed. At the same time, across long-hour video dataset, only

small portion of the dataset is meaningful, where it is difficult/impossible for human

to watch and identify meaningful scene and data from long-hour video.

Figure 1.1: Control Room with Many CCTV Monitors

Figure 1.1 shows a Control Room with many CCTV monitors, displaying

current situation of each scene via the CCTV. As shown in the figure, there are many

monitors, where human need to supervise and watch each video scene, which is

impossible for human eye to pay attention to every monitor at the same second. At the

same time, human will become tired, dizzy and fatigue after watching long-hour video

scene. Based on research, human attention time span will drop and only limited to 20

minutes or lower, affecting the justification and detection power for normal and

abnormal video scene (Kohn 2014).

By using the traditional approach, bounded supervision by human is required

to identify where any surveillance/abnormal activity occurred. As too many monitors

to be monitor at the same time, control room staff will miss out some of the abnormal

scene/activities due to human factor and too many monitors to watch at the same time

which is difficult to focus each scene.

18

In the other situation, when crime have already happened and police wanted to

replace the recorded video scene and watch the crime scene again, police will need to

watch and speed long hour to check where is the moment that crime is happening,

which is very time consuming and tiring.

1.2.2 Existing Video Analytic Tools often triggers False Alarm and Not

Adaptive (Dalli n.d.).

In the current century, there are existing video analytic tools, but it often triggers false

alarm and not adaptive to the real nature.

Figure 1.2: Video Analytic Tool using 2D Virtual Fencing Approach (Facial

Recognition & Video Analytic Software 2016)

Figure 1.3: Surveillance Detected by Virtual Fencing Approach (Facial Recognition

& Video Analytic Software 2016)

19

Video analytic tool in Figure 1.2, uses the approach call virtual fencing

approach, which is one of the rule-based approach. Virtual fencing approach using

existing video analysis tool, the video analysis tool is using virtual fencing approach,

creating a virtual fence where triggers alerts when objects like vehicle, person or

abnormal object that move across the virtual fence. As if something breaches the

virtual fence (Figure 1.3) will trigger an alarm for abnormality, which might be correct

or might be wrong.

On the other point of view, virtual fence created in Figure 1.2 only applicable

for that particular scene only. If this system wanted to be apply into another situation,

new trip wire need to be design and created again based on the particular situation

behaviour and use case. In a result, virtual fencing approach is not adaptive and

configuration required in order to apply in new camera scene. Other than that, virtual

fencing approach only detect object that trespassed the trip wire, any abnormal event

that happened without breaching the trip wire, will not be detected as well.

1.2.3 Existing Video Analytic Tool involved Complicated Configuration

during Setup and Difficult to Maintain (Narciso 2014).

Setting up a video analytic tool for 1 scene, is not so easy and direct. The process

requires extensive configuration (Honovich, 2008). As shown in Figure 1.2, virtual

fence created for anomaly detection, will only be applicable for the particular scene

only. New setup process is required if the scene is different.

At the same time, existing video analytic tool require high maintenance cost in

order to handle false alarm situation. Due to earth nature and human nature, there are

normal nature events such as wind blowing, weather change, light intensity change,

environment change, and etc. (Honovich, 2008). These events will be normal for

human being, but abnormal for the machine. Various and multiple change with normal

nature activity, might be detected as abnormal by the tool, as long as it crosses the trip

wire. In a result, the tool will always trigger false alarm.

Maintenance is required in order to handle and fulfil environmental nature

specifications and events, but it is hectic, expensive and time consuming.

20

1.3 Project Objective

To develop a traffic scene anomaly detector using deep learning approach

To conceptualize the concept and approach of deep learning and autoencoder

in detecting anomaly in traffic scene

To calculate the accuracy and evaluate effectiveness of detector performing

anomaly detection

1.4 Scope

1.4.1 Deliverable

In this project, a traffic scene anomaly detector will be developed. The detector will

be developed using deep learning approach. Before detection start, the detector will be

trained by a series of actual traffic video scene, consisting video series with normal

traffic scene only.

After the detector is trained, as unseen traffic scene will be used as testing video

dataset for detection purpose. During the detection phase, it will undergo back-end

calculation, and display abnormal scene/video scene that have low regularity

score/video scene that have not seen before.

1.4.2 Modules Covered

Autoencoder model implementation

Pre-processing of input video scene

Traffic scene anomaly detection

Back end calculation and evaluation method

1.4.3 Modules Not Covered

Night scene anomaly detection

Weather change scene anomaly detection

Type of anomaly detected

Object detection

Movable/Orientating camera scene

21

CHAPTER 2

2 RELATED WORK

2.1 Existing Approaches

In video analytics and surveillance detection market, there are some existing tools

using rule based approach.

2.1.1 Rule Based Approach Detection

By using rule based approach detection, developer need to define a pool of rules, where

these are the rules/situation where they wanted to be catch. At the same time, different

scene has different nature events, where different scene will have different pool of

rules. On the other hand, if an abnormal event occurred and it does not belong to the

defined pool of rules, the system will not trigger and no anomaly is detected, even

though there are abnormal event. By performing anomaly detection via rule based

approach, it indirectly triggered 3 major problems (Honovich, 2008).

Expensive setup cost extensive configuration

Every different scene has own nature and have different events to occur; every

different scene has different abnormal events to be detected. By using rule based

approach, a new pool of rules need to be define for a new scene. Not only that, a

lot of back end configuration and algorithms need to be set up in the particular

camera in order to start detection process. Setting up a rule based video analytic

tool will takes hours and days (Narciso 2014).

High maintenance cost and require periodic maintenance

As mentioned before, the configuration only designed and defined for the current

scene view only. The camera need to remain static in order to fulfil the

configuration. If the camera moves or shakes, it will affect the analytical

performance of the configured rules. Even the movement of camera is very small,

the pool of rules might be inaccurate, maintenance is required. At the same time,

massive change in the scene background will cause the same problem, and defined

rules need to be updated and maintain (Honovich, 2008).

22

Frequent triggers false alarm

A system that often trigger false alarm is practically ineffective and inefficient. If

a system often triggers false alarm, making human become busier in handling false

alarm situation, where actually false alarm detected was just a normal scene.

One of the condition that trigger false alarm is change in weather (Honovich,

2008). Sun rises from east and sets at west. Indirectly, from morning to evening,

sun position changes from time to time, making the shadow position of object

changes proportionally. When shadow position changes, and unfortunately the

shadow cuts the trip wire, false alarm will be triggered. In the human and

environment nature, shadow position changing is practically normal. But from the

point of view from the machine, the machine does not know what is shadow, where

assuming it as something cuts through the trip wire and trigger false alarm.

2.1.1.1 Virtual Fencing Approach

Virtual fencing technique is one of the rule based approach, where virtual fence/trip

wire is designed and created for the particular camera scene. As shown in Figure 1.2,

trip wires are implemented for surveillance detection/anomaly detection (Facial

Recognition & Video Analytic Software 2016). The machine will detect event as

anomaly when something breaches the trip wire. The trip wire is designed to handle

real 3D case, but from the view of the camera, camera able to see 2D only. In the

environment nature, there will be situation like bird flying around. The 2D plane

camera view will see this situation as an object passing through the trip wire and will

trigger false alarm. A lot of exclusion need to be made in order to maximize the

performance. In short, it inherits the unavoidable problems from rules based approach

which are false alarm and tedious configuration.

Virtual fencing is suitable for situation with lesser event occurrence. For

example, animal control within a barn (Butler et al., 2006). In an animal barn, the event

of occurrence is practically lesser unlike traffic scene with extraneous, complicated

and unexpected change in the scene. The trip wire is setup at the fencing, when an

animal crosses the fence. The situation in the barn is pretty simple, if an animal crosses

the fence, alarm will be trigger. In short, virtual fencing might not be suitable and not

capable in handling vast change of the problem and environment. Virtual fencing is

not suitable in video analytics.

23

2.2 Deep Learning Technique

Deep learning approach able to overcome the problems addressed by rule-based

approach.

2.2.1 Convolutional Neural Network (CNN)

Convolutional neural network is a feedforward network that creates convolution kernel

where the input layer will receive input data and undergo convolving and learning

features which produce a tensor of outputs (Keras, 2016). When the input image feed

into the convolutional network, the network will extract the features of the input image,

and store information for each pixel of the image in feature map.

While creating a Convolutional Neural Network, hyperparameters such as filter

size, filter width, filter height, stride need to be initialized before training started. Each

hyperparameter at 1 input layer will affect the output size/input size of the next layer.

2.2.2 Long Short-Term Memory(LSTM)

Long Short-Term Memory is improved version for RNN, which is able to learn long-

term dependencies. LSTM consists of multiple recurrent neural network that formed a

series of recursive modules of neural network where each RNN having 1 tanh layer

for activation function.

Figure 2.1: Repeating Module in a Standard RNN

Figure 2.2: Repeating Module in a LSTM Containing 4 Layers

24

However, repeating module in LSTM contains 4 neural network layers, having

different structure with Standard RNN that have only 1 neural network layer. The

architecture for LSTM allows it to learn, understand and remember information and

trend, which can be accumulated together for long or short duration without getting

“forgotten” and discard.

2.2.3 Building Autoencoder

Autoencoder is the model architecture that is suitable to be used for anomaly detection.

Autoencoding consists compression and decompression process which are used for

learning purposes. Compression and decompression also known as encoder layer and

decoder layer respectively.

Figure 2.3: Autoencoder Sample (Chollet 2016)

As shown in Figure 2.3, an image will be input to the encoder layer. In the

encoder layer, the image will be compress to smaller and smaller number of feature

maps, consisting all important information in the compressed layer. After learning in

the compact situation, the decoder will expand the feature maps into the size of the

input image, constructing an output image with the same size as the input based on

what it learns during the compression process.

To evaluate the learning processes, comparison has been made between output

and input. Loss value is the significant measure on the ratio of information lost. If the

loss value is low, it means the autoencoder learns well. By using the example in Figure

2.1, input image is an integer “2”, and output image produced an image which have

around 95% similarity. Before we start detecting, we must be able to learn well.

25

On the other hand, there are several types of autoencoder. In order to fulfill the

functionality of the case, we need to choose the correct autoencoder. For example, if

we want to learn video dataset that contain spatial-temporal characteristics, we need to

use spatiotemporal autoencoder. At the same time, it is difficult to implement the layers

for spatiotemporal autoencoder in order to fit and accept the volume of video dataset.

Even though the model able to fit the data firmly, it does not mean it will be able to

learn the input and construct output with minimal loss value. A lot of hyperparameters

and constraints need to be take care in order to design a high-performance architecture.

2.2.4 Application of Autoencoder in Anomaly Detection

Regularity score is one of the method used to detect anomaly (Hasan , et al., 2016).

Each scene will undergo regularity score calculation. To visualize the differences, a

graph of regularity score against frame number is used, the maximum score is 1 and 0.

At the same time, regularity benchmark can be set based on the sensitivity of system

detection required. If the particular scene has lower regularity score than the

benchmark, it means the scene is abnormal. At the same time, the benchmark cannot

be set into too high, the sensitivity will become too high and always trigger false alarm.

By calculating regularity score for each scene, it able to fulfill the functionality

of this project and trigger anomaly detected when scenes with low regularity score are

found.

2.3 Summary

Rule-based approach is not suitable for video surveillance purposes but it might be

suitable for handling situation with lesser events of occurrence. Applying rule-based

approach in video surveillance system will cause the 3 major problems which are high

setup cost, high maintenance cost and false alarm.

Inefficiency of rule-based approach can be overcome by deep learning

approach. By using autoencoder which is one of the technique in deep learning, it able

to learn the situation and rate the scene abnormality using back-end calculation. To

create an effective and learnable model, fine tuning on hyperparameters need to be

done and the model will be evaluated via loss value.

26

CHAPTER 3

3 PROPOSED SOLUTION

3.1 Deep Learning Approach via Data Driven Approach

In this project, deep learning approach will be used to create an autoencoder model

using convolutional spatiotemporal architecture via data-driven approach.

In compared with the existing technique used in video analytics, extensive

configuration and setup need to be done in each camera view (Honovich 2008).

Configurations includes rules and cases that wanted to be detected or identified. In

contrast, there uncountable possibility of event occurrence in the particular scene,

same applies to normal and abnormal scene. In short, if the programmer logic and

algorithms might have a possibility that unable to capture some anomaly. At the same

time, the logics and rules can only be applied on this particular scene only, vast change

in the background might trigger false alarm as well but actually is normal in the nature.

For example, weather change and shadow position deviates due to position of sun

(Narciso 2014).

By using deep learning approach, an autoencoder model will be developed and

train with real traffic scene data. For example, the model is used to train for 5 days in

the real situation. Assuming the traffic in 5 days are normal scene only. After 5 days,

the machine will be used for classification purposes. The machine will use all the

information learnt in the last 5 days as a benchmark. If the machine has saw something

that have big variance and difference from the benchmark, alarm will be triggered as

anomaly is detected.

27

Figure 3.1: Sample Traffic Scene with Normal and Abnormal Driving Direction

By using the example in Figure 3.1, the machine is train for 3 days, and all the

traffic scenes in this 3 days are in normal driving direction only(Green Path). After the

training phase, the machine will used in classification phase, which is determining

abnormal driving behaviours. As if any car that do not drive according to the normal

direction, it will be considered as anomaly. Red path is the sample abnormal driving

behavior, it will be detected as abnormal event.

This approach is a data-driven and environmental approach which overcomes

the problems existed in rule-based approach.

28

3.2 System Overview

Figure 3.2: System Overview

29

Figure 3.2 shows the entire system overview and whole idea of the system that is going

to be implemented. The system will be started from the training phase. By using video

dataset with normal traffic scene only, video is transformed into image frames and

undergo pre-processing. After the image have been pre-processed, the image will be

converted into Numpy array and store in HDF5 format. At the same time, image

Numpy array are stored and volumized into volume to impose spatial-temporal

characteristics. For example, 8 frames into 1 volume. After formatting the data, the

data will fit into the autoencoder to train the model. The weight will be updated and

saved in every epoch, and the model will become more and more “clever” theoretically.

To visualize the training performance of the model, loss graph is used to inspect the

performance for every epoch.

For the classification phase, the video dataset used will be real unseen traffic

scene data, consisting normal and abnormal scene. The data will undergo same

processes as in training phase, and formatted via HDF5. Simultaneously, the model

will use the trained weight with the lowest loss value, to start classifying and detecting

anomaly. Every scene will undergo backend calculation which is regularity score. If

the particular scene has lower regularity score than the benchmark, the particular scene

will be considered as anomalous event. Not only that, the predicted output will be

evaluated using ROC curve which records the true positive rate and false positive rate

of the prediction made.

30

3.3 Technology and Techniques Involved

3.3.1 Deep Learning Framework

To implement a spatiotemporal autoencoder, Keras deep learning framework (Keras

2016) provide complete documentation, functionality and support in implementing for

each different layer of the autoencoder. Not only that, Keras provides clear

documentation with parameters explanation as well as example for implementing

different layers and functionality.

To have a clearer explanation, we can access to the GitHub of Keras to view

some detailed implementations and values of the certain parameters.

3.3.2 Autoencoder Model Architecture

Figure 3.3: Model Architecture

The model is built according to the architecture of a simple autoencoder, consisting

encoder layer and decoder layer. In current model design, it was developed using Keras,

consists of spatiotemporal encoder and spatiotemporal decoder to learn and handle

width, height and time constraint of the input dataset.

In the encoder layer, it consists of ConvNet2d Layer with TimeDistributed

Wrapper. For ConvNet2d layer, it will handle the width and height constraint of the

dataset where the TimeDistributed wrapper handle individually every temporal slide

of the input. The filter size, width size and height size for each layer will decrease

accordingly for encoder layer. After CovNet2d Layer, the feature will undergo

temporal encoder, which is ConvLSTM2d layer for learning temporal characteristics.

The recurrent neural network in ConvLSTM2d only process over the temporal

dimension of the dataset while using local convolution to hold spatial values

31

(Patraucean, Handa, & Cipolla, 2016). At the same time, ConvLSTM2d layer

consisting of Bidirectional Wrapper provided in Keras, for RNN feature to capture

more temporal context.

As it reaches the decoder layer, the deconvolutional layer size become bigger

as the size pattern in the encoder layer. At last, the features learnt will be reconstructed

into the same format size as the input size. On the other hand, after each processing

and learning layer in the model, it will pass through a batch normalization layer, to

normalize features value, to prevent extreme difference in value, minimize vast range

between value, enhance the model performance and accelerate the learning speed.

3.3.3 Datasets Pre-processing

3.3.3.1 Nature and Content of Training Dataset

Unexpected traffic scene such as accident is the sample abnormal scene. For current

dataset, it does not have these unlucky scenes such as accident. To fulfil the

performance and use case of anomaly detection, only normal driving behaviour in the

same direction is used as the training dataset, as driver will normally drive with normal

size (sedan) vehicle in the same direction only in the highway. Any traffic scene that

have different driving behaviour, such as vehicle stopping by the shoulder of the

highway will be detected as anomaly. Anomaly is something that deviates and

different from the normal behaviour and trend. Other than that, scene with large vehicle

such as truck is considered as anomaly as well to increase the ratio of abnormal scene.

3.3.3.2 Convert Video to Image Frames

In order to fit the data to the autoencoder, it is impossible to straight away to input the

whole video. Video dataset need to be transform into image frames to ease pre-

processing processes.

AVCONV is one of the library from libav, which is an open source audio and

video processing tools (Libav 2016). By using AVCONV, it allows us to convert video

to image frames. For example, convert the each second of the video to 20 image frames,

at framerate 20 frames/sec. Of course, we need to know the original number of frames

per second of the video before we convert, if we convert into lower framerate than the

original framerate, the semantic meaning of the particular second might be lost.

32

3.3.3.3 Grayscale

After the image frames are extracted from the video datasets, the original images are

colour images with 3 colour channels (Red, Green and Blue). The original colour

images will be converted into grayscale images. Training images with colours might

be difficult for the detector to learn. Not only that, it will take approximately 3 times

of the time need to train the detector comparing the detector training with grayscale

images. At the same time, grayscale able to reduce the variation of colour and prevent

from confusing the detector which might reduce the accuracy.

Figure 3.4: Grayscale Image

3.3.3.4 Image Cropping

In current dataset, it records two-way direction traffic where left lane is driving away

from the camera and the right lane is driving towards the camera. At the same time,

the camera is placed on the right lane, the vehicles driving on the left lane is difficult

to see even for human, as the distance is further and object seen is smaller. Other than

that, the camera field of view includes the sky, construction side and tall building.

In current approach, left lane of the traffic and sky recorded is crop out for

denoising purposes. 150px long is cropped for both width and height from every

original frame, resulting smaller and more precise scene at 570px for width and 426px

for height.

Figure 3.5: Cropping Image

33

3.3.3.5 Image Resizing

Resizing is smaller the shape of the original image. By using the Plus Highway Dataset,

the cropped frame width is 570px and original frame height is 426px (570 X 426). The

original image size is too large and it will take more time for the detector to understand

the image. In current approach, all the image frames are resized to width 227px and

height 227px (227 X 227).

Figure 3.6: Image Resize

3.3.4 Data Formatting

3.3.4.1 Storing images into volume

Plain 2D images only have width constraint and height constraint. It does not have

spatiotemporal characteristics and do not have motion constraints. In order to trigger

spatiotemporal characteristics, images are put into volumes. For example, 8 frames

of images put into one volume, producing 3D (time, width, height) data. With

spatiotemporal characteristics, it allows the detector to understand and learn the

motion of the video.

Figure 3.7: Image Frames Storing in Volume

34

3.3.4.2 Data Storage Formatting (HDF5)

As data storing in three dimensions, it will cause another problem which is curse of

dimensionality. Sparse dimension data will cause a lot of memory usage when we train

the detector. If we input the whole sparse data into the machine, even high specification

device might have memory error due to not enough memory. To overcome the curse

of dimensionality, the data is store in Hierarchical Data Format 5 (HDF5). Using

HDF5 format, the input volume data is by matrix, instead of input the whole sparse

data into the machine. The memory usage is much more lesser.

3.4 Evaluation Method

3.4.1 Training Phase Evaluation

3.4.1.1 Mean Square Error (MSE)

While implementing the model of the architecture, it is configured to calculate the loss

value. Loss value displayed each training epoch is mean square error (MSE), which is

the average of square of errors, difference between original result and estimated result.

For this project, an output will be reconstructed based on input. Comparison and

calculation will be make based on input and output based on loss, the smaller the loss

mean the model learns better. The ideal successful and effective learning process

should produce smaller and smaller loss value in every epoch.

3.4.2 Classification Phase Evaluation

3.4.2.1 Regularity Score

Based on the output of the dataset reconstructed after passing through the autoencoder,

it will be used to compare with the original result to calculate regularity score. If the

regularity score is lower than the benchmark, it means an abnormality occurred in the

scene.

35

3.4.2.2 Area Under Curve(AUC), Error Equal Rate(EER), Receiver Operating

Characteristic (ROC) Curve

Accuracy of the system is very important. By using ROC Curve, we able to rate and

calculate the positive predictive level of the system (Tran et al., 2015). ROC Curve is

plotted using True Positive Rate (TPR) and False Positive Rate (FPR). Error Equal

Rate is obtained when the False Positive Rate meets the False Negative Rate. The

smaller the EER, the better the performance. Table below shows the confusion matrix

of the result, where true positive is abnormal event detected, false positive is false

alarm triggered, true negative is normal event detected and false negative is missed

abnormal event.

Table 3.1: Confusion Matrix

Predicted

Positive Negative

Actual Positive True Positive(TP) False Negative(FN)

Negative False Positive(FP) True Negative(TN)

Table 3.2: Confusion Matrix with Description

Predicted

Positive Negative

Actual Positive Anomalous Event Missed Event

Negative False Alarm Normal Event

Figure 3.8: ROC Curve

36

By using TPR and FPR, we are able to plot ROC curve and rate the

performance and positive predictive level of the model. Area under curve(AUC) will

be calculated at the same time TPR and FPR using Scikit-Learn library. If the model

has high positive predictive level, it will have a lot of true positive events and least or

no false positive events. Indirectly, model with high positive predictive level will not

or seldom trigger false alarm.

37

CHAPTER 4

4 EXPERIMENT, RESULT AND ANALYSIS

4.1 Dataset

Traffic scene used for this experiment is Plus Highway traffic scene. Video scene used

for training is during afternoon session, with normal light intensity. Training dataset

only contain normal traffic scene, where vehicle is travelling in 1 same direction and

no vehicle is stopping in the highway. Only traffic scene with normal driving traffic is

used in the training process, any other traffic scene is filtered during the training

process.

Figure 4.1: Sample Normal Traffic Scene

Figure 4.1 shows the sample video scene for training dataset with some car

driving in the scene.

38

4.1.1 Dataset Cropping

According to the original video dataset, the frame width and height is 720 X 576.

Figure 4.2: Original Traffic Scene

As shown in Figure 4.2, it is the original scene of the video dataset. The video

scene shows two-way traffic condition for Federal Highway. At the same time, as the

camera angle is projected quite high, the point of view included the sky, construction

site, tall building, and trees.

Figure 4.3: Static Area in Blue Highlighted

Figure 4.3 shows blue region highlighted the elements described previously.

After watching most of the video dataset, the highlighted blue region, covering nearly

35 percent of the entire camera view, does not have any changes in activity and remain

constant for almost 95 percent among all the video dataset. Even if there are activity

occurred at the blue region, it is difficult to be seen and identified as well, even for

human sight, because the distance is too far and the object become smaller and smaller.

39

As most of the blue region is remaining constant, it should be cropped out and

remove during training and testing phase. If whole scene was pumped into training

phase, the machine will learn that the constant blue region as normal scene. During the

testing phase, every pixel of the image frame takes into account for regularity score

calculation. As the blue region covers almost 35 percent, where abnormal event region

will be 5 percent to 10 percent only, it is possible that the constant blue region will

neutralize abnormal event and regularity score. This will affect the performance of the

machine and missed the event due to neutralization from the constant blue region.

Figure 4.4: Cropped Traffic Scene

In order to overcome the circumstances, the training and testing dataset is

cropped. Other than that, object that are far away and small also being cropped as well.

4.1.2 Data Augmentation

For training purposes, a series of video with normal scene only is required. By using 1

of the video as input data, pre-processing and filtering out abnormal events in the video

is done to generate the clean training video with normal scene only. As the training

video is around 30 minutes only, more data is needed to train the autoencoder model.

As the number of parameters in the autoencoder model is large, more amount of data

is required for training purposes where current size of given training video datasets is

not sufficient enough to train the model (Hasan , et al., 2016).

To increase the training dataset, augmentation is done. Data augmentation is

done in the temporal dimension of the data. The training dataset still having the same

shape with 8 consecutive frames, however with various type of strides skipping pattern.

In current augmentation approach, the input dataset is concatenated the frames with

40

stride-1, stride-2, stride-3 and stride-4. For stride-1 pattern, it samples the data volume

with consecutive frames, where in stride-2, stride-3 and stride-4, it skips/steps one

frame, two frames and three frames, forming pattern {1,2,3,4,5,6,7,8},

{1,3,5,7,9,11,13,15}, {1,4,7,10,13,16,19,22} and {1,5,9,13,17,21,25,29} respectively.

By using this approach, the training dataset was expanded from 4092 volumes

to 9795 volumes for better training purposes and experience.

4.2 Batch Processing

For testing purposes, an unseen video scene (1 Hour long) is processed into respective

format which is (X, 1, 8, 227, 227; X stands for total number of volumes). For a 1 hour

long video, it will be transformed into a HDF5 file with storage as large as 32GB.

During testing phase, a 32GB will be pumped into the memory, and the predicted

output is around 15GB.

After the testing phase, evaluation phase require data from original test data

(32 GB), predicted output (15 GB), total up for around 50 GB. Data is required for

regularity score calculation and ROC calculation. As the LXC container have only

limited to 40 GB memory, it is impossible for the PC to hold that much data in limited

memory, it will kill the process when the memory is fully utilized.

Figure 4.5: Job Killed Due to Memory Limit

As volumized data require large amount of space and memory, batch loading

is required. Batch loading do not require to load all data into the memory in one time,

booming the memory space. Based on current approach, each batch only loads for

1000 volumes and do calculation for 1000 volumes in each batch, where the memory

usage is still within the utilization range.

41

4.3 Evaluation Phase

4.3.1 Training Phase Visualization

Figure 4.6: Training Epochs Loss Value

By using the augmented training video dataset, the autoencoder was trained for 50

epochs. The loss value recorded is mean squared error (MSE), measures the average

of square of error/deviations, representing how well is the model learning on the

features for the input dataset. The lower the loss value, the better the model learns the

feature. The decreasing loss value trend also shows that the model able to learn better

after each epoch. However, having practically low loss value during training phase

does not mean the model able to produce accurate result during testing. Testing is

required by using unseen dataset on the model.

During the training phase, each training epoch will save the model weight, as

every different weight have different loss value. During the testing phase, the trained

model weight with the lowest loss value is used.

42

4.3.2 Reconstruction Error

By using the trained model, the model will predict and analyse an unseen data. The

output from the prediction and analysis will be the same format as the input size, which

is 9000 X 1 X 8 X 227 X 227).

Reconstruction error is calculated by calculating Euclidian’s Distance for each

frame between original dataset and predicted output. To get the reconstruction error

for one frame, Euclidian’s Distance for each pixel is calculated, and summation of

Euclidian’s Distance for every pixel for the particular frame will be the reconstruction

error the frame.

𝑒(𝑡, 𝑥, 𝑦) = ‖𝐼(𝑡, 𝑥, 𝑦) − 𝑓𝑤(𝐼(𝑡, 𝑥, 𝑦))‖2

𝑒(𝑡) = ∑ 𝑒(𝑡, 𝑥, 𝑦)(𝑥,𝑦)

where

𝑒 = Reconstruction Error

𝑡, 𝑥, 𝑦 = Frame number, Width/X position and Height/Y position

𝐼 = Original Test Data

𝑓𝑤 = Predicted Output from Test Data

𝑒(𝑡) = Summation of Reconstruction Error for every pixel in frame 𝑡

4.3.3 Normalization

Each reconstruction error is normalized using the formula below, scaling the

reconstruction error from 0 to 1 only. Normalization allows easier visualization for

regularity score.

𝑒𝑖 =𝑒𝑖−𝐸𝑚𝑖𝑛

𝐸𝑚𝑎𝑥− 𝐸𝑚𝑖𝑛

Where

𝐸𝑚𝑖𝑛 = Minimum value for variable E

𝐸𝑚𝑎𝑥 = Maximum value for variable E

43

4.3.4 Regularity Score

To get the regularity score of the frame, reconstruction error will be minus by

1. Frame with high regularity score means the frame is normal, if the frame has

practically low regularity score, the frame is abnormal.

𝑠(𝑡) = 1 − 𝑒(𝑡)

Where

𝑠(𝑡) = Regularity Score for frame 𝑡

4.4 Result

Raw unseen dataset will undergo transformation into volumized required input shape

format by the model for testing purposes. Each raw unseen dataset is 1 hour long,

generating 72000 frames at 20 frame/sec rate. After prediction, regularity score will

be calculated for each image frame of the video. 0.4 will be the benchmark set for

anomality detected, any image frame that has regularity score lower than 0.4 will be

consider as abnormal event/traffic scene.

4.4.1 Video Version 1

Version 1 is the first dataset used for anomaly detection. For video version 1, the light

intensity of the scene is the same with the training dataset. Figure 4.7 shows the

regularity score calculated for each frame after prediction.

Figure 4.7: Regularity Score Calculated for Video Version 1

44

As mentioned, this dataset is used because there are buses that will stops by the

highway to load/unload passengers, this is the main anomaly event that was planned

to catch. Based on current prediction, scenes with large vehicle are detected as anomaly

as well.

Figure 4.8: Large Vehicle Detected in Video Version 1

As in training dataset, it contains scene with large vehicle such as lorry and

truck, but the shape is different with the one as predicted. Unexpectedly generating

positive result, the large vehicle detected as shown are large vehicles that the model

has never seen before.

The other scene that detected as bus stopping by the highway event. Event

involves consecutive process which is bus slowing down by the highway, stopped,

load/unload passenger, start departing and leaving from the camera view. All processes

are detected as anomaly by the model.

Figure 4.9: Bus Stopping Event Detected in Video Version 1

45

Figure 4.10 shows other bus stopping events detected in current dataset version.

Figure 4.10: Other Bus Stopping Event Detected in Video Version 1

Figure 4.11 shows ROC Curve, AUC calculated and EER calculated for Video

Version 1.

Figure 4.11: ROC Curve for Video Version 1

46


Version 2 is the second dataset used for anomaly detection. For video version 2, the

light intensity of the scene is the same with the training dataset. Figure 4.12 shows the

regularity score calculated for each frame after prediction.


Below are the example scenes that have regularity score lower than 0.4. For

the first situation detected, the model detected a new abnormal event. Instead of

detecting bus stopping event, the model able to detect smaller vehicle stopping event

(Taxi). The event detected includes consecutive events which is taxi slowing down,

stopped, passenger approaching taxi, passenger talking and asking for price with the

taxi driver. The entire event is detected as anomaly.

Figure 4.13: Taxi Stopping Event Detected in Video Version 2

47

For the second situation detected in version 2, the scene includes bus stopping

event and a sedan car stopping behind the road signboard. As the bus stopping event

and sedan car stopping event occurred together, it was considered that the model is

detecting the bus stopping event where unable to detect sedan car stopping event which

is not obvious and partially blocked by the signboard. According to the original dataset,

the bus arrived, stopped and departed at the scene, where the sedan car still stopped

behind the scene after the bus have departed. However, the predicted output result

shows that the model able to detect the sedan car stopping behind the signboard. This

shows that the model is capable to detect abnormal event even is partially blocked by

a static object such as signboard.

Figure 4.14: Bus Stopping Event and Vehicle Stopping Behind Signboard in Video

Version 2

Other abnormal events such as bus stopping event and large vehicle event also

able to be detected by the same model in different dataset.

Figure 4.15: Large Vehicle Event Detected in Video Version 2

48

Figure 4.16: Bus Stopping Event Detected in Video Version 2

However, in Version 2 dataset, most of the false alarm occurred in the traffic

condition when there are less vehicle driving in the scene.

Figure 4.17: Traffic Condition with Less Car

Figure 4.18 shows ROC Curve, AUC calculated and EER calculated for Video Version

2.

Figure 4.18: ROC Curve for Video Version 2

49


Version 3 is the third dataset used for anomaly detection. For video version 3, the light

intensity of the scene is different with the training dataset. The estimated time frame

for the dataset is around 7.00 a.m. to 8.00 a.m. In the first 30 minutes, the light intensity

is dimmer as the sun still haven’t rise. After the first 30 minutes, the sun started to rise

periodically.

Figure 4.19 shows the regularity score calculated for each frame after

prediction. Based on the result, the detection behaviour became extensively abnormal

at the second 30 minutes of the dataset.


At the beginning part of the video, the sun still haven rise and the light intensity

is slightly lower than the training dataset. On behalf of that, some of the video scene

recorded is a bit blur, compared with the training dataset which is clearer. Even though

the some of the video scene is blur, the model still able to detect abnormal events.

Figure 4.19 shows large vehicle events detected even though the scene recorded is

blurrier than the usual scene.

50

Figure 4.20: Large Vehicle Detected in Blur Condition in Video Version 3

Large vehicle event detected by model in video version 3 shown in Figure 4.21.

Figure 4.21: Large Vehicle Detected in Video Version 3

Figure 4.22 shows event detected with two large vehicles at the same time.

Figure 4.22: Two Large Vehicle Detected in Video version 3

51

For the second 30 minutes of the video, the sun started to rise, estimating

around 7.30 a.m. The result shows all the frames after 7.30 a.m. having low regularity

score and detected as anomaly. There are few factors that causes this to occur.

As the sun rises, more and stronger light ray is emitted and seen in the scene.

Unfortunately, the light ray has coincidently reflected into the convex lens of the

camera. Reflection of light inside the lens produces a light focal point mark in the

camera view. As a result, the model has never seen this situation before, therefore

predicting this kind of situation as abnormal event, having practically low regularity

score.

Secondly, the camera is facing east direction, which is the position of the run

rises. While sun rises periodically from east direction, where vehicle is approaching

west direction, it will produce more and more shadow for each object. For instance,

shadow of the car and shadow of the billboard. During the training process, the training

dataset do not contain scene with object shadow, estimating the time is around noon,

where no/little shadow on each and every object.

As shown in Figure 4.23, when sun rises periodically, the initial reflection focal

point in the camera lens become more obvious. At the same time, the shadow of each

object become smaller when the sun rises higher and higher.

Figure 4.23: Scene with Light Reflection and Shadow in Video Version 3

52

Figure 4.24 shows ROC Curve, AUC calculated and EER calculated for Video

Version 3.

Figure 4.24: ROC Curve in Video Version 3

4.5 Ground Truth Labelling Method

During the first ground truth attempt, the ground truth labelled abnormal event

(abnormal = 1) even though the abnormal event is not obvious and very small.

As shown in the Figure 4.25, on the left, there is 1 passenger waiting

and standing by the highway, where on the left, it shows a human is crossing

the highway from left to right. These are some unobvious abnormal event that

labelled as abnormal.

Figure 4.25: Scene with Unobvious Anomaly

53

In the version 1 dataset, even though the model able to detect abnormal event,

but the AUC is extensively low, at 0.548 only. Low AUC indirectly means the model

in inefficient. Having AUC 0.548, is like doing random guessing without knowledge.

Based on the analysis of the detected scene, it shows that all the scene detected

are scene with obvious difference with the normal traffic scene. Unlike scene as shown

in Figure 4.26, human walking, none of these events can be detected.

By adjusting the ground truth values, where only setting obvious abnormal

scene as abnormal and setting unobvious abnormal scene as normal. This adjustment

has greatly increased the AUC from 0.548 to 0.769 and EER dropped from 0.463 to

0.312.

Figure 4.26: ROC Curve Improvement with Obvious Scene Only

In short, this adjustment means that the model only able to detect obvious

abnormal scene and larger of object of interest.

4.6 Video Average Handling Method

As seen in all previous regularity score result, each consecutive video frame is actually

almost similar, however the model has calculated the difference of regularity score

between each frame having big gap, where around 0.4 variance. The variance between

each consecutive frame should not have such a big difference, since that each

consecutive frame is near to similar.

To overcome this issue, data handling and reconstruction error calculation have

been handle with overlapping and average handling approach. For the original

processing attempt for testing dataset, the volume dataset was stored consecutively

with no overlapping with pattern: {1,2,3,4,5,6,7,8}, {9,10,11,12,13,14,15,16}, …,

54

{71993,71994,71995,71996,71997,71998,71999,72000}. Reconstruction error also

highly dependent on each frame of each volume only.

In current approach, data was processed and stored via overlapping current

volume’s last 4 continuous frames with next volume’s first 4 continuous frames. The

first 4 frames (1,2,3,4) and last 4 frames (71997,71998,71999,72000) do not have

overlap with any other volume because first volume do not have previous volume and

last volume do not have next volume.

This approach have doubled up the size of the testing dataset to 18000 volumes.

During evaluation phase, reconstruction error for each frame is obtained on the average

of reconstruction error on predicted result of same frame number in different volume.

For example, to obtain reconstruction error for frame number 9, the total reconstruction

error is obtained from the sum of 5th frame in the second volume and 1st frame in the

third volume. The final reconstruction error for frame 9 is total reconstruction error

divided by 2.

Based on the result generated, overlapping and average handling shows the

variance in regularity score for each consecutive frame have been reduced around 50

percent as compared with the direct approach in storing testing dataset and calculating

regularity score. The variance of regularity score of continuous frame have become

from around 0.4 to around 0.2.

The generated regularity score via average handling method produced same

graph curve with dataset with normal storing approach. However, the reconstruction

error and regularity score for each frame have been reduce, resulting frames that have

regularity score lower than 0.4 in normal data handling approach having regularity

Figure 4.25

Figure 4.27: Overlapping Data Handling

55

score higher than 0.4 in overlapping data handling approach. This have reduced the

amount of anomaly detected in overlapping handling approach, making the model to

be more precise in each event.

Same applies to ROC Curve which calculated using reconstruction error, each

reconstruction error is more accurate and precise, having nearer range with the ground

truth, resulting better AUC and EER in ROC calculation.

56

Figure 4.28: Variance of Regularity Score Enhancement for Video Version 1



57

Figure 4.31: ROC Curve Enhancement for Video Version 1



58

4.7 Summary

4.7.1 Result Summary

Table 4.1: Result Summary

Dataset Expected

Anomalous Event

Predicted

Anomalous Event

False

Alarm

AUC EER

1 31 12 1 0.769 0.312

1 (Average Method) 31 7 0 0.812 0.252

2 32 24 16 0.697 0.363

2 (Average Method) 32 6 4 0.751 0.316

3 37 14 ∞ 0.547 0.443

3 (Average Method) 37 2 ∞ 0.558 0.452

Based on the result summary, it shows even the model have a lot of missed

detected events (False Negative), but the AUC is still quite high. Based on inspection

of missed events for 3 datasets, the missed events fall into the same category, which is

large vehicle events. The model prediction behaviour is inconsistent, some of the large

vehicle event is detected as anomaly where some detected as normal.

At the same time, across the entire large vehicle in view event, it takes around

3 secs only. Missed in detecting these events will only decrease small amount of AUC

and increase small amount of EER as the duration is very short. However, the model

able to detect vehicle stopping event by the highway shoulder successfully for dataset

version 1 and version 2. Whole consecutive process is detected as anomaly and the

duration is long, approximately 30 secs – 90 secs. Detecting long and consecutive

event able to boost the AUC and EER as it involved large quantity of frames.

Other than that, the model trigger false alarm where detecting normal traffic

scene as anomaly. Based on inspection, the false alarm conditions are normal traffic

scene and traffic scene with lesser vehicle at the same time. For dataset version 3, sun

rises at the second half of the video, causing the model the detect every frame after the

second 30th minute of the video as anomaly, generating uncountable amount of false

alarm.

By using average and overlapping data method, the result shows that the

quantity of false alarm decreases, but the anomalous event detected decreases as well.

59

When average takes into account, the reconstruction error become smaller as compared

with the normal approach, indirectly making the model prediction to be more precise

and accurate for each frame. This is also the reason for increasing the AUC and

reducing the EER. By using overlapping approach, the result shows that the output is

more precise, major abnormal event which is vehicle stopping event able to be detected

in dataset version 1 and 2, and quantity of minor abnormal event which is large vehicle

event drastically reduced in dataset version 1, 2 and 3.

4.7.2 Model Summary

Based on the result predicted from the model in 3 different datasets, it shows the model

have the capability to detect anomalous event in the traffic scene. However, the model

only able to detect abnormal event with obvious object of interest, such as bus stopping

event, taxi stopping event, sedan car stopping event and large vehicle event. For

abnormal event with smaller object of interest, such as human crossing the highway

and human standing/waiting by the highway shoulder, the model unable to detect such

anomaly event as the object of interest is smaller compared with vehicles, the model

is unable to “see” the existence of human.

On the other hand, based on testing on dataset version 3, there is light reflection

of sun ray in the camera lens when the sun rises. Reflection is the normal activity of

the nature. In normal human lifestyle, we see reflection everywhere, as it produces no

harm to us. However, for the point of view from the model, it does not have heuristic

knowledge and understand that it is just a reflection of light. Instead of considering it

as normal nature event, the model detected it as abnormal event. Based on the result,

it inferred that current model unable to handle nature events such as sun ray reflection.

Other than light reflection, there are events such as weather change, raining, thunder

storm and etc. Training with scene consisting such nature events is required in order

to efficiently handling nature events.

CHAPTER 5

5 CONCLUSION AND RECOMMENDATIONS

5.1 Conclusion

In this paper, we have presented and proven that data-driven approach and

unsupervised learning method able to develop a traffic anomaly detector. The model

able to recognize, learn and detect abnormal events that have never seen before.

As CCTV camera are everywhere, a lot of valuable videos are recording, but it

requires a lot of time and effort to perform analysis and detection by human which is

very hectic. By using this approach, the model will be able to analyse and identify

meaningful events among all the events in the video, minimizing the human job and

minimal/no human supervision is required.

By using the spatiotemporal autoencoder, it receives 3D input, allowing the

model to learn each pixel by pixel and motion from consecutive frames in 1 volume.

Learning motion characteristics allow the understand more about the behaviour of the

event.

During anomaly identification, each frame predicted output will undergo

regularity score calculation to identify whether is anomaly. Frames that have lower

regularity score than the benchmark will be considered as anomaly. However, current

training dataset used only covers traffic scene with no natural practical issue such as

night, sun reflection and weather change. Handling anomaly detection in various

condition using the same model will be very challenging.

By using Plus Highway video dataset as the experimental use case situation,

the dataset has undergone pre-processing and learnt/trained by the autoencoder.

Autoencoder learns and understands the situational behaviour of the traffic, traffic

scene with low regularity score than then benchmark, will be consider as traffic

behaviour that never seen before (anomalous event). Based on the detection output and

calculation, the result shows the autoencoder is able to detect anomalous scene,

achieving on average 0.7 AUC based on different datasets.

5.2 Recommendations for future work

5.2.1 Deeper Neural Network for Autoencoder Model

In the current model, the model only able to detect anomaly event with obvious object

of interest only, but not event with smaller object of interest. To overcome this issue,

the model need to add more layer in the spatial layer and the temporal layer, allowing

the model to access the pixel deeper, accessing and learning the features more precisely

and accurately in every pixel of the image.

5.2.2 Train Model with Larger Dataset

For the current training dataset, it only consists of around 30 minutes long of normal

scene training dataset. To enhance the performance of the model, the model should be

train with more and longer training dataset consisting more normal traffic scene.

Training with more dataset allow the model to learn variety of normal traffic scene and

reduce the quantity of false alarm triggered.

5.2.3 Enhance Data Augmentation Technique

In the current data augmentation technique, we added stride-1, stride-2, stride-3, stride-

4 to augment and increase the dataset. This approach does not increase the amount of

image frames. To augment and increase the image frames, the image can be augmented

via adding blur effect and rotation and translation of the image for around 1 to 2 degree

from all direction. Rotation and translation able to handle the situation like wind blow

or strong wind, where the camera will be shaky and the view will be moved for

approximately 1 to 2 degrees as well. Augmentation on image frames indirectly

increases the number of volumes of training dataset.

5.2.4 Dataset Expansion Covering Different Natural Situation

The model only able to handle normal traffic situation with minimal/no natural

practical events. In nature, sun will rise and set, rain, thunderstorm and etc., producing

morning, afternoon, night and weather change scenario. At night, driver will switch on

the head light where the head light will strike on the road surface and the street light

will be on as well. More noise is generated as compared with the normal situation.

Figure 5.1: Night Traffic Scene

Other than that, during rainy day, the camera view will see many water

dropping and unable “see” the situation clearly, generating even more noise. For more

unfortunate case, if the water droplet falls on the camera screen, it will be a water mark

blocking the camera view as well.

In short, data expansion consisting morning scene, afternoon scene, night scene

and weather change in order to improve the model capability in handling various types

of natural practically issues and events.

5.2.5 Training Approach for Handling Different Situation

However, to improve the model by training new and different situation, configuration

on training hyperparameter need to be done. In Keras, we can use callback functions

such as LearningRateScheduler and ReduceLROnPlateau. For LearningRateScheduler,

the initialized learning rate will decrease by defined factor after each training epoch

where ReduceLROnPlateau will only decreased the learning rate by defined factor

when the loss value is constant and the model has stopped improving for 𝑛𝑡ℎ number

of epoch. At the same time, the learning rate need to initialized smaller than the original

training model learning rate. Using learning rate callback function is required to

prevent original important heuristic knowledge is being overwritten by the new

information and knowledge.

REFERENCES

Anon., 2016. CCTV Facial Recognition & Video Analytics Software Systems - Essex

&UK. [Online] Available at: http://www.clearview-communications.com/cctv/facial-

recognition-video-analytics [Accessed 4 July 2016].

Butler, Z., Corke, P., Peterson, R. and Rus, D., 2006. From Robots to Animals: Virtual

Fences for Controlling Cattle. USA, Computer Science and Artificial Intelligence

Laboratory, MIT, pp. 1-24.

Chollet, F., 2016. Building Autoencoder in Keras. [Online]

Available at: https://blog.keras.io/building-autoencoders-in-keras.html

[Accessed 4 August 2016].

Collier, A., 2015. Making Sense of Logarithmic Loss. [Online]

Available at: https://www.r-bloggers.com/making-sense-of-logarithmic-loss/

[Accessed 29 July 2016].

Dalli, A., n.d. Intelligent Traffic Scene Analysis. [Online]

Available at: http://traffiko.com/press/articles/intelligent-traffic-scene-analysis/

[Accessed 7 July 2016].

Hasan , M. et al., 2016. Learning Temporal Regularity in Video Sequences. UC,

Riverside, ARXIV, p. 1-40.

Honovich, J., 2008. Top 3 Problems Limiting the Use and Growth of Video Analytics.

[Online] Available at: https://ipvm.com/reports/top-3-problems-limiting-the-use-and-

growth-of-video-analytics [Accessed 1 July 2016].

Keras, 2016. [Online] Available at: keras.io [Accessed 11 August 2016].

Kohn, A., 2014. Brain Science: Focus-Can You Pay Attention?. [Online]

Available at: http://www.learningsolutionsmag.com/articles/1440/brain-science-

focuscan-you-pay-attention [Accessed 25 July 2016].

Libav, 2016. [Online] Available at: https://libav.org/avconv.html [Accessed 28 July

2016].

Narciso, G., 2014. The Evolution of Video Analytics: Past Failures to Accurate Crime

Preventing Tool. [Online]

Available at: http://avigilon.com/news/innovation/the-evolution-of-video-analytics-

past-failures-to-accurate-crime-preventing-tool/ [Accessed 2 July 2016].

Patraucean, V., Handa, A., & Cipolla, R. (2016, September 1). Spatio-Temporal Video

Autoencoder With Differentiable Memory. Retrieved from Arxiv:

https://arxiv.org/pdf/1511.06309.pdf [Accessed 2 August 2016].

Tran, D. et al., 2015. Learning Spatiotemporal Features with 3D Convolutional

Networks. Retrieved from Arxiv: https://arxiv.org/pdf/1604.04574v1.pdf [Accessed 2

January 2017].

traffic scene anomaly detection

Documents