traffic scene anomaly detection
TRANSCRIPT
TRAFFIC SCENE ANOMALY DETECTION
LUI CAI AN
A project report submitted in partial fulfilment of the
requirements for the award of Bachelor of Engineering
(Hons.) Software Engineering
Lee Kong Chian Faculty of Engineering and Science
UniversitiTunku Abdul Rahman
April 2017
ii
DECLARATION
I hereby declare that this project report is based on my original work except for
citations and quotations which have been duly acknowledged. I also declare that it has
not been previously and concurrently submitted for any other degree or award at
UTAR or other institutions.
Signature :
Name : Lui Cai An
ID No. : 1305946
Date :
iii
APPROVAL FOR SUBMISSION
I certify that this project report entitled “TRAFFIC SCENE ANOMALY
DETECTION” was prepared by LUI CAI AN has met the required standard for
submission in partial fulfilment of the requirements for the award of Bachelor of
Science (Hons.) Software Engineering at Universiti Tunku Abdul Rahman
Approved by,
Signature :
Supervisor : Dr. Tay Yong Haur
Date :
Signature :
Co-Supervisor :
Date :
iv
The copyright of this report belongs to the author under the terms of the
copyright Act 1987 as qualified by Intellectual Property Policy of Universiti Tunku
Abdul Rahman. Due acknowledgement shall always be made of the use of any material
contained in, or derived from, this report.
© 2017, Lui Cai An. All right reserved.
v
ACKNOWLEDGEMENTS
I would like to thank everyone who had contributed to the successful completion of
this project. I would like to express my gratitude to my research supervisor, Dr. Tay
Yong Haur for his invaluable advice, guidance and his enormous patience throughout
the development of the research.
Special thanks to Chong Yong Shean, who shared a lot experience and insight
to me related to current project. Several discussions and sharing session allows me to
understand more about the project. Without her, it might be more difficult in
undergoing each process in this project
Lastly, I would like to thanks to family members and friends who provide
supports and encouragement throughout the entire project. Without the positive
support and encouragement, the project will not be able to complete as scheduled.
vi
ABSTRACT
Abstracting and inspecting meaningful activities from long hour video is very
challenging. In the traditional approach in video analytic and anomaly detection, it
applies ruled-based approach where a set of rules have been predefined as the pool of
meaningful events that wanted to be detected. By using the ruled-based approach, it
limits the detection performance, often triggers false alarm, require extensive
configuration setup for each particular scenario and require heavy maintenance after
setup. To overcome the limitation of traditional approach, we propose deep learning
approach via data-driven approach using non-labelled data, training a spatiotemporal
autoencoder model by feeding with normal traffic scene dataset. The autoencoder
model consists of Convolutional Neural Network (CNN) and Long Short-Term
Memory (LSTM) for learning spatial and temporal characteristics in video dataset.
This unsupervised approach able to minimize and reduce dependency on human,
where only limited human supervision is required. By using the Plus Highway dataset,
the result shows the model able to perform anomaly detection.
vii
TABLE OF CONTENTS
DECLARATION ii
APPROVAL FOR SUBMISSION iii
ACKNOWLEDGEMENTS v
ABSTRACT vi
TABLE OF CONTENTS vii
LIST OF TABLES x
LIST OF FIGURES xi
LIST OF SYMBOLS / ABBREVIATIONS xiv
CHAPTER
1 INTRODUCTION 15
1.1 Background 15
1.2 Problem Statement 17
1.2.1 Tons of Data Generated from Time to Time 17
1.2.2 Existing Video Analytic Tools often triggers False
Alarm and Not Adaptive (Dalli n.d.). 18
1.2.3 Existing Video Analytic Tool involved Complicated
Configuration during Setup and Difficult to Maintain (Narciso
2014). 19
1.3 Project Objective 20
1.4 Scope 20
1.4.1 Deliverable 20
1.4.2 Modules Covered 20
1.4.3 Modules Not Covered 20
2 RELATED WORK 21
2.1 Existing Approaches 21
viii
2.1.1 Rule Based Approach Detection 21
2.2 Deep Learning Technique 23
2.2.1 Convolutional Neural Network (CNN) 23
2.2.2 Long Short Term Memory(LSTM) 23
2.2.3 Building Autoencoder 24
2.2.4 Application of Autoencoder in Anomaly
Detection 25
2.3 Summary 25
3 PROPOSED SOLUTION 26
3.1 Deep Learning Approach via Data Driven Approach 26
3.2 System Overview 28
3.3 Technology and Techniques Involved 30
3.3.1 Deep Learning Framework 30
3.3.2 Autoencoder Model Architecture 30
3.3.3 Datasets Pre-processing 31
3.3.4 Data Formatting 33
3.4 Evaluation Method 34
3.4.1 Training Phase Evaluation 34
3.4.2 Classification Phase Evaluation 34
4 EXPERIMENT, RESULT AND ANALYSIS 37
4.1 Dataset 37
4.1.1 Dataset Cropping 38
4.1.2 Data Augmentation 39
4.2 Batch Processing 40
4.3 Evaluation Phase 41
4.3.1 Training Phase Visualization 41
4.3.2 Reconstruction Error 42
4.3.3 Normalization 42
4.3.4 Regularity Score 43
4.4 Result 43
4.4.1 Video Version 1 43
ix
4.4.2 Video Version 2 46
4.4.3 Video Version 3 49
4.5 Ground Truth Labelling Method 52
4.6 Video Average Handling Method 53
4.7 Summary 58
4.7.1 Result Summary 58
4.7.2 Model Summary 59
5 CONCLUSION AND RECOMMENDATIONS 60
5.1 Conclusion 60
5.2 Recommendations for future work 61
5.2.1 Deeper Neural Network for Autoencoder Model 61
5.2.2 Train Model with Larger Dataset 61
5.2.3 Enhance Data Augmentation Technique 61
5.2.4 Dataset Expansion Covering Different Natural
Situation 61
5.2.5 Training Approach for Handling Different
Situation 62
REFERENCES 63
x
LIST OF TABLES
Table 3.1: Confusion Matrix 35
Table 3.2: Confusion Matrix with Description 35
Table 4.1: Result Summary 58
xi
LIST OF FIGURES
Figure 1.1: Control Room with Many CCTV Monitors 17
Figure 1.2: Video Analytic Tool using 2D Virtual Fencing
Approach (Facial Recognition & Video Analytic
Software 2016) 18
Figure 1.3: Surveillance Detected by Virtual Fencing Approach
(Facial Recognition & Video Analytic Software
2016) 18
Figure 2.1: Repeating Module in a Standard RNN 23
Figure 2.2: Repeating Module in a LSTM Containing 4 Layers 23
Figure 2.3: Autoencoder Sample (Chollet 2016) 24
Figure 3.1: Sample Traffic Scene with Normal and Abnormal
Driving Direction 27
Figure 3.2: System Overview 28
Figure 3.3: Model Architecture 30
Figure 3.4: Grayscale Image 32
Figure 3.5: Cropping Image 32
Figure 3.6: Image Resize 33
Figure 3.7: Image Frames Storing in Volume 33
Figure 3.8: ROC Curve 35
Figure 4.1: Sample Normal Traffic Scene 37
Figure 4.2: Original Traffic Scene 38
Figure 4.3: Static Area in Blue Highlighted 38
Figure 4.4: Cropped Traffic Scene 39
Figure 4.5: Job Killed Due to Memory Limit 40
Figure 4.6: Training Epochs Loss Value 41
Figure 4.7: Regularity Score Calculated for Video Version 1 43
xii
Figure 4.8: Large Vehicle Detected in Video Version 1 44
Figure 4.9: Bus Stopping Event Detected in Video Version 1 44
Figure 4.10: Other Bus Stopping Event Detected in Video Version
1 45
Figure 4.11: ROC Curve for Video Version 1 45
Figure 4.12: Regularity Score Calculated for Video Version 2 46
Figure 4.13: Taxi Stopping Event Detected in Video Version 2 46
Figure 4.14: Bus Stopping Event and Vehicle Stopping Behind
Signboard in Video Version 2 47
Figure 4.15: Large Vehicle Event Detected in Video Version 2 47
Figure 4.16: Bus Stopping Event Detected in Video Version 2 48
Figure 4.17: Traffic Condition with Less Car 48
Figure 4.18: ROC Curve for Video Version 2 48
Figure 4.19: Regularity Score Calculated for Video Version 3 49
Figure 4.20: Large Vehicle Detected in Blur Condition in Video
Version 3 50
Figure 4.21: Large Vehicle Detected in Video Version 3 50
Figure 4.22: Two Large Vehicle Detected in Video version 3 50
Figure 4.23: Scene with Light Reflection and Shadow in Video
Version 3 51
Figure 4.24: ROC Curve in Video Version 3 52
Figure 4.25: Scene with Unobvious Anomaly 52
Figure 4.26: ROC Curve Improvement with Obvious Scene Only 53
Figure 4.27: Overlapping Data Handling 54
Figure 4.28: Variance of Regularity Score Enhancement for Video
Version 1 56
Figure 4.29: Variance of Regularity Score Enhancement for Video
Version 2 56
xiii
Figure 4.30: Variance of Regularity Score Enhancement for Video
Version 3 56
Figure 4.31: ROC Curve Enhancement for Video Version 1 57
Figure 4.32: ROC Curve Enhancement for Video Version 2 57
Figure 4.33: ROC Curve Enhancement for Video Version 3 57
Figure 5.1: Night Traffic Scene 62
xiv
LIST OF SYMBOLS / ABBREVIATIONS
2D 2 Dimensions
3D 3 Dimensions
CCTV Closed-Circuit Television
CNN Convolutional Neural Network
EER Error Equal Rate
HDF5 Hierarchical Data Format 5
ROC Receiving Operating Characteristics
RGB Red, Green, Blues
RNN Recurrent Neural Network
15
CHAPTER 1
1 INTRODUCTION
1.1 Background
In the current era, CCTV video cameras are everywhere, recording all the scene
happening for every second.
Monitoring and abstract meaningful events/activities in long-hour videos is
very hectic and challenging, human supervision is required to monitor the scene, which
is very time consuming and energy consuming. Due to human factor, we will become
tired and dizzy after watching long-hour video, where the video scene is almost the
same, except if something else happens. In short, even though meaningful
events/activities are only small ratio of the entire long -hour video, human still need to
monitor all the video sequences where most of the sequences are meaningless.
Not only that, by approaching video analysis via the traditional way, where
human supervision is needed to watch, learn and determine the motion patterns in the
video scene corresponding with the human nature and normal life cycle. However,
human might be missed out some of the significant/meaningful events/activities
displayed in the monitor due to human issues.
As years passed, there are more and more new video analytic tools developed
and invented using technology, to minimize the dependency on human. However, the
existing video analytic tools are not up to the point yet, because it requires complicated
configurations, high maintenance and often triggers false alarm, which ended up
causing human to be more busy handling alarm, where still not very effective and
efficient in handling video analytic problem.
As technology rises and grown drastically, deep learning appeared, and it able
to learn and behave like a human, able to perform and act like a real human and might
be performing even better than human. By using deep learning approach, we built a
model that used to detect any anomaly in the video scene.
Before starting the analytic and detecting process, the model will be trained
with normal scene video scenes first. After training the model, the experienced model
will be capable to detect abnormal scene, which are scene that have not been seen in
the normal video scene. The model will learn and calculate regularity score for each
image frames and video scene, using the spatial-temporal (space and time)
16
characteristics, where human supervision is not required to watch all the video scene.
If any scene is calculated to have low regularity score, it means that the particular scene
is something that the model never seen before, an anomaly alarm will be triggered
(Hasan et al., 2016).
By applying deep learning approach in video analytic, machine is able to
performing video analysis, rather than human watch all the long-hour video one by
one, minimizing the workload and dependencies on human. At the meantime, better
and high-end hardware is required in order to perform analysis faster.
17
1.2 Problem Statement
1.2.1 Tons of Data Generated from Time to Time
In the current era, video camera/surveillance camera are everywhere, recording video
for every second, generating more and more data, where most of the video scene
dataset is just for the sake of recording, only very small ratio of the video dataset is
being processed and analysed. At the same time, across long-hour video dataset, only
small portion of the dataset is meaningful, where it is difficult/impossible for human
to watch and identify meaningful scene and data from long-hour video.
Figure 1.1: Control Room with Many CCTV Monitors
Figure 1.1 shows a Control Room with many CCTV monitors, displaying
current situation of each scene via the CCTV. As shown in the figure, there are many
monitors, where human need to supervise and watch each video scene, which is
impossible for human eye to pay attention to every monitor at the same second. At the
same time, human will become tired, dizzy and fatigue after watching long-hour video
scene. Based on research, human attention time span will drop and only limited to 20
minutes or lower, affecting the justification and detection power for normal and
abnormal video scene (Kohn 2014).
By using the traditional approach, bounded supervision by human is required
to identify where any surveillance/abnormal activity occurred. As too many monitors
to be monitor at the same time, control room staff will miss out some of the abnormal
scene/activities due to human factor and too many monitors to watch at the same time
which is difficult to focus each scene.
18
In the other situation, when crime have already happened and police wanted to
replace the recorded video scene and watch the crime scene again, police will need to
watch and speed long hour to check where is the moment that crime is happening,
which is very time consuming and tiring.
1.2.2 Existing Video Analytic Tools often triggers False Alarm and Not
Adaptive (Dalli n.d.).
In the current century, there are existing video analytic tools, but it often triggers false
alarm and not adaptive to the real nature.
Figure 1.2: Video Analytic Tool using 2D Virtual Fencing Approach (Facial
Recognition & Video Analytic Software 2016)
Figure 1.3: Surveillance Detected by Virtual Fencing Approach (Facial Recognition
& Video Analytic Software 2016)
19
Video analytic tool in Figure 1.2, uses the approach call virtual fencing
approach, which is one of the rule-based approach. Virtual fencing approach using
existing video analysis tool, the video analysis tool is using virtual fencing approach,
creating a virtual fence where triggers alerts when objects like vehicle, person or
abnormal object that move across the virtual fence. As if something breaches the
virtual fence (Figure 1.3) will trigger an alarm for abnormality, which might be correct
or might be wrong.
On the other point of view, virtual fence created in Figure 1.2 only applicable
for that particular scene only. If this system wanted to be apply into another situation,
new trip wire need to be design and created again based on the particular situation
behaviour and use case. In a result, virtual fencing approach is not adaptive and
configuration required in order to apply in new camera scene. Other than that, virtual
fencing approach only detect object that trespassed the trip wire, any abnormal event
that happened without breaching the trip wire, will not be detected as well.
1.2.3 Existing Video Analytic Tool involved Complicated Configuration
during Setup and Difficult to Maintain (Narciso 2014).
Setting up a video analytic tool for 1 scene, is not so easy and direct. The process
requires extensive configuration (Honovich, 2008). As shown in Figure 1.2, virtual
fence created for anomaly detection, will only be applicable for the particular scene
only. New setup process is required if the scene is different.
At the same time, existing video analytic tool require high maintenance cost in
order to handle false alarm situation. Due to earth nature and human nature, there are
normal nature events such as wind blowing, weather change, light intensity change,
environment change, and etc. (Honovich, 2008). These events will be normal for
human being, but abnormal for the machine. Various and multiple change with normal
nature activity, might be detected as abnormal by the tool, as long as it crosses the trip
wire. In a result, the tool will always trigger false alarm.
Maintenance is required in order to handle and fulfil environmental nature
specifications and events, but it is hectic, expensive and time consuming.
20
1.3 Project Objective
To develop a traffic scene anomaly detector using deep learning approach
To conceptualize the concept and approach of deep learning and autoencoder
in detecting anomaly in traffic scene
To calculate the accuracy and evaluate effectiveness of detector performing
anomaly detection
1.4 Scope
1.4.1 Deliverable
In this project, a traffic scene anomaly detector will be developed. The detector will
be developed using deep learning approach. Before detection start, the detector will be
trained by a series of actual traffic video scene, consisting video series with normal
traffic scene only.
After the detector is trained, as unseen traffic scene will be used as testing video
dataset for detection purpose. During the detection phase, it will undergo back-end
calculation, and display abnormal scene/video scene that have low regularity
score/video scene that have not seen before.
1.4.2 Modules Covered
Autoencoder model implementation
Pre-processing of input video scene
Traffic scene anomaly detection
Back end calculation and evaluation method
1.4.3 Modules Not Covered
Night scene anomaly detection
Weather change scene anomaly detection
Type of anomaly detected
Object detection
Movable/Orientating camera scene
21
CHAPTER 2
2 RELATED WORK
2.1 Existing Approaches
In video analytics and surveillance detection market, there are some existing tools
using rule based approach.
2.1.1 Rule Based Approach Detection
By using rule based approach detection, developer need to define a pool of rules, where
these are the rules/situation where they wanted to be catch. At the same time, different
scene has different nature events, where different scene will have different pool of
rules. On the other hand, if an abnormal event occurred and it does not belong to the
defined pool of rules, the system will not trigger and no anomaly is detected, even
though there are abnormal event. By performing anomaly detection via rule based
approach, it indirectly triggered 3 major problems (Honovich, 2008).
Expensive setup cost extensive configuration
Every different scene has own nature and have different events to occur; every
different scene has different abnormal events to be detected. By using rule based
approach, a new pool of rules need to be define for a new scene. Not only that, a
lot of back end configuration and algorithms need to be set up in the particular
camera in order to start detection process. Setting up a rule based video analytic
tool will takes hours and days (Narciso 2014).
High maintenance cost and require periodic maintenance
As mentioned before, the configuration only designed and defined for the current
scene view only. The camera need to remain static in order to fulfil the
configuration. If the camera moves or shakes, it will affect the analytical
performance of the configured rules. Even the movement of camera is very small,
the pool of rules might be inaccurate, maintenance is required. At the same time,
massive change in the scene background will cause the same problem, and defined
rules need to be updated and maintain (Honovich, 2008).
22
Frequent triggers false alarm
A system that often trigger false alarm is practically ineffective and inefficient. If
a system often triggers false alarm, making human become busier in handling false
alarm situation, where actually false alarm detected was just a normal scene.
One of the condition that trigger false alarm is change in weather (Honovich,
2008). Sun rises from east and sets at west. Indirectly, from morning to evening,
sun position changes from time to time, making the shadow position of object
changes proportionally. When shadow position changes, and unfortunately the
shadow cuts the trip wire, false alarm will be triggered. In the human and
environment nature, shadow position changing is practically normal. But from the
point of view from the machine, the machine does not know what is shadow, where
assuming it as something cuts through the trip wire and trigger false alarm.
2.1.1.1 Virtual Fencing Approach
Virtual fencing technique is one of the rule based approach, where virtual fence/trip
wire is designed and created for the particular camera scene. As shown in Figure 1.2,
trip wires are implemented for surveillance detection/anomaly detection (Facial
Recognition & Video Analytic Software 2016). The machine will detect event as
anomaly when something breaches the trip wire. The trip wire is designed to handle
real 3D case, but from the view of the camera, camera able to see 2D only. In the
environment nature, there will be situation like bird flying around. The 2D plane
camera view will see this situation as an object passing through the trip wire and will
trigger false alarm. A lot of exclusion need to be made in order to maximize the
performance. In short, it inherits the unavoidable problems from rules based approach
which are false alarm and tedious configuration.
Virtual fencing is suitable for situation with lesser event occurrence. For
example, animal control within a barn (Butler et al., 2006). In an animal barn, the event
of occurrence is practically lesser unlike traffic scene with extraneous, complicated
and unexpected change in the scene. The trip wire is setup at the fencing, when an
animal crosses the fence. The situation in the barn is pretty simple, if an animal crosses
the fence, alarm will be trigger. In short, virtual fencing might not be suitable and not
capable in handling vast change of the problem and environment. Virtual fencing is
not suitable in video analytics.
23
2.2 Deep Learning Technique
Deep learning approach able to overcome the problems addressed by rule-based
approach.
2.2.1 Convolutional Neural Network (CNN)
Convolutional neural network is a feedforward network that creates convolution kernel
where the input layer will receive input data and undergo convolving and learning
features which produce a tensor of outputs (Keras, 2016). When the input image feed
into the convolutional network, the network will extract the features of the input image,
and store information for each pixel of the image in feature map.
While creating a Convolutional Neural Network, hyperparameters such as filter
size, filter width, filter height, stride need to be initialized before training started. Each
hyperparameter at 1 input layer will affect the output size/input size of the next layer.
2.2.2 Long Short-Term Memory(LSTM)
Long Short-Term Memory is improved version for RNN, which is able to learn long-
term dependencies. LSTM consists of multiple recurrent neural network that formed a
series of recursive modules of neural network where each RNN having 1 tanh layer
for activation function.
Figure 2.1: Repeating Module in a Standard RNN
Figure 2.2: Repeating Module in a LSTM Containing 4 Layers
24
However, repeating module in LSTM contains 4 neural network layers, having
different structure with Standard RNN that have only 1 neural network layer. The
architecture for LSTM allows it to learn, understand and remember information and
trend, which can be accumulated together for long or short duration without getting
“forgotten” and discard.
2.2.3 Building Autoencoder
Autoencoder is the model architecture that is suitable to be used for anomaly detection.
Autoencoding consists compression and decompression process which are used for
learning purposes. Compression and decompression also known as encoder layer and
decoder layer respectively.
Figure 2.3: Autoencoder Sample (Chollet 2016)
As shown in Figure 2.3, an image will be input to the encoder layer. In the
encoder layer, the image will be compress to smaller and smaller number of feature
maps, consisting all important information in the compressed layer. After learning in
the compact situation, the decoder will expand the feature maps into the size of the
input image, constructing an output image with the same size as the input based on
what it learns during the compression process.
To evaluate the learning processes, comparison has been made between output
and input. Loss value is the significant measure on the ratio of information lost. If the
loss value is low, it means the autoencoder learns well. By using the example in Figure
2.1, input image is an integer “2”, and output image produced an image which have
around 95% similarity. Before we start detecting, we must be able to learn well.
25
On the other hand, there are several types of autoencoder. In order to fulfill the
functionality of the case, we need to choose the correct autoencoder. For example, if
we want to learn video dataset that contain spatial-temporal characteristics, we need to
use spatiotemporal autoencoder. At the same time, it is difficult to implement the layers
for spatiotemporal autoencoder in order to fit and accept the volume of video dataset.
Even though the model able to fit the data firmly, it does not mean it will be able to
learn the input and construct output with minimal loss value. A lot of hyperparameters
and constraints need to be take care in order to design a high-performance architecture.
2.2.4 Application of Autoencoder in Anomaly Detection
Regularity score is one of the method used to detect anomaly (Hasan , et al., 2016).
Each scene will undergo regularity score calculation. To visualize the differences, a
graph of regularity score against frame number is used, the maximum score is 1 and 0.
At the same time, regularity benchmark can be set based on the sensitivity of system
detection required. If the particular scene has lower regularity score than the
benchmark, it means the scene is abnormal. At the same time, the benchmark cannot
be set into too high, the sensitivity will become too high and always trigger false alarm.
By calculating regularity score for each scene, it able to fulfill the functionality
of this project and trigger anomaly detected when scenes with low regularity score are
found.
2.3 Summary
Rule-based approach is not suitable for video surveillance purposes but it might be
suitable for handling situation with lesser events of occurrence. Applying rule-based
approach in video surveillance system will cause the 3 major problems which are high
setup cost, high maintenance cost and false alarm.
Inefficiency of rule-based approach can be overcome by deep learning
approach. By using autoencoder which is one of the technique in deep learning, it able
to learn the situation and rate the scene abnormality using back-end calculation. To
create an effective and learnable model, fine tuning on hyperparameters need to be
done and the model will be evaluated via loss value.
26
CHAPTER 3
3 PROPOSED SOLUTION
3.1 Deep Learning Approach via Data Driven Approach
In this project, deep learning approach will be used to create an autoencoder model
using convolutional spatiotemporal architecture via data-driven approach.
In compared with the existing technique used in video analytics, extensive
configuration and setup need to be done in each camera view (Honovich 2008).
Configurations includes rules and cases that wanted to be detected or identified. In
contrast, there uncountable possibility of event occurrence in the particular scene,
same applies to normal and abnormal scene. In short, if the programmer logic and
algorithms might have a possibility that unable to capture some anomaly. At the same
time, the logics and rules can only be applied on this particular scene only, vast change
in the background might trigger false alarm as well but actually is normal in the nature.
For example, weather change and shadow position deviates due to position of sun
(Narciso 2014).
By using deep learning approach, an autoencoder model will be developed and
train with real traffic scene data. For example, the model is used to train for 5 days in
the real situation. Assuming the traffic in 5 days are normal scene only. After 5 days,
the machine will be used for classification purposes. The machine will use all the
information learnt in the last 5 days as a benchmark. If the machine has saw something
that have big variance and difference from the benchmark, alarm will be triggered as
anomaly is detected.
27
Figure 3.1: Sample Traffic Scene with Normal and Abnormal Driving Direction
By using the example in Figure 3.1, the machine is train for 3 days, and all the
traffic scenes in this 3 days are in normal driving direction only(Green Path). After the
training phase, the machine will used in classification phase, which is determining
abnormal driving behaviours. As if any car that do not drive according to the normal
direction, it will be considered as anomaly. Red path is the sample abnormal driving
behavior, it will be detected as abnormal event.
This approach is a data-driven and environmental approach which overcomes
the problems existed in rule-based approach.
28
3.2 System Overview
Figure 3.2: System Overview
29
Figure 3.2 shows the entire system overview and whole idea of the system that is going
to be implemented. The system will be started from the training phase. By using video
dataset with normal traffic scene only, video is transformed into image frames and
undergo pre-processing. After the image have been pre-processed, the image will be
converted into Numpy array and store in HDF5 format. At the same time, image
Numpy array are stored and volumized into volume to impose spatial-temporal
characteristics. For example, 8 frames into 1 volume. After formatting the data, the
data will fit into the autoencoder to train the model. The weight will be updated and
saved in every epoch, and the model will become more and more “clever” theoretically.
To visualize the training performance of the model, loss graph is used to inspect the
performance for every epoch.
For the classification phase, the video dataset used will be real unseen traffic
scene data, consisting normal and abnormal scene. The data will undergo same
processes as in training phase, and formatted via HDF5. Simultaneously, the model
will use the trained weight with the lowest loss value, to start classifying and detecting
anomaly. Every scene will undergo backend calculation which is regularity score. If
the particular scene has lower regularity score than the benchmark, the particular scene
will be considered as anomalous event. Not only that, the predicted output will be
evaluated using ROC curve which records the true positive rate and false positive rate
of the prediction made.
30
3.3 Technology and Techniques Involved
3.3.1 Deep Learning Framework
To implement a spatiotemporal autoencoder, Keras deep learning framework (Keras
2016) provide complete documentation, functionality and support in implementing for
each different layer of the autoencoder. Not only that, Keras provides clear
documentation with parameters explanation as well as example for implementing
different layers and functionality.
To have a clearer explanation, we can access to the GitHub of Keras to view
some detailed implementations and values of the certain parameters.
3.3.2 Autoencoder Model Architecture
Figure 3.3: Model Architecture
The model is built according to the architecture of a simple autoencoder, consisting
encoder layer and decoder layer. In current model design, it was developed using Keras,
consists of spatiotemporal encoder and spatiotemporal decoder to learn and handle
width, height and time constraint of the input dataset.
In the encoder layer, it consists of ConvNet2d Layer with TimeDistributed
Wrapper. For ConvNet2d layer, it will handle the width and height constraint of the
dataset where the TimeDistributed wrapper handle individually every temporal slide
of the input. The filter size, width size and height size for each layer will decrease
accordingly for encoder layer. After CovNet2d Layer, the feature will undergo
temporal encoder, which is ConvLSTM2d layer for learning temporal characteristics.
The recurrent neural network in ConvLSTM2d only process over the temporal
dimension of the dataset while using local convolution to hold spatial values
31
(Patraucean, Handa, & Cipolla, 2016). At the same time, ConvLSTM2d layer
consisting of Bidirectional Wrapper provided in Keras, for RNN feature to capture
more temporal context.
As it reaches the decoder layer, the deconvolutional layer size become bigger
as the size pattern in the encoder layer. At last, the features learnt will be reconstructed
into the same format size as the input size. On the other hand, after each processing
and learning layer in the model, it will pass through a batch normalization layer, to
normalize features value, to prevent extreme difference in value, minimize vast range
between value, enhance the model performance and accelerate the learning speed.
3.3.3 Datasets Pre-processing
3.3.3.1 Nature and Content of Training Dataset
Unexpected traffic scene such as accident is the sample abnormal scene. For current
dataset, it does not have these unlucky scenes such as accident. To fulfil the
performance and use case of anomaly detection, only normal driving behaviour in the
same direction is used as the training dataset, as driver will normally drive with normal
size (sedan) vehicle in the same direction only in the highway. Any traffic scene that
have different driving behaviour, such as vehicle stopping by the shoulder of the
highway will be detected as anomaly. Anomaly is something that deviates and
different from the normal behaviour and trend. Other than that, scene with large vehicle
such as truck is considered as anomaly as well to increase the ratio of abnormal scene.
3.3.3.2 Convert Video to Image Frames
In order to fit the data to the autoencoder, it is impossible to straight away to input the
whole video. Video dataset need to be transform into image frames to ease pre-
processing processes.
AVCONV is one of the library from libav, which is an open source audio and
video processing tools (Libav 2016). By using AVCONV, it allows us to convert video
to image frames. For example, convert the each second of the video to 20 image frames,
at framerate 20 frames/sec. Of course, we need to know the original number of frames
per second of the video before we convert, if we convert into lower framerate than the
original framerate, the semantic meaning of the particular second might be lost.
32
3.3.3.3 Grayscale
After the image frames are extracted from the video datasets, the original images are
colour images with 3 colour channels (Red, Green and Blue). The original colour
images will be converted into grayscale images. Training images with colours might
be difficult for the detector to learn. Not only that, it will take approximately 3 times
of the time need to train the detector comparing the detector training with grayscale
images. At the same time, grayscale able to reduce the variation of colour and prevent
from confusing the detector which might reduce the accuracy.
Figure 3.4: Grayscale Image
3.3.3.4 Image Cropping
In current dataset, it records two-way direction traffic where left lane is driving away
from the camera and the right lane is driving towards the camera. At the same time,
the camera is placed on the right lane, the vehicles driving on the left lane is difficult
to see even for human, as the distance is further and object seen is smaller. Other than
that, the camera field of view includes the sky, construction side and tall building.
In current approach, left lane of the traffic and sky recorded is crop out for
denoising purposes. 150px long is cropped for both width and height from every
original frame, resulting smaller and more precise scene at 570px for width and 426px
for height.
Figure 3.5: Cropping Image
33
3.3.3.5 Image Resizing
Resizing is smaller the shape of the original image. By using the Plus Highway Dataset,
the cropped frame width is 570px and original frame height is 426px (570 X 426). The
original image size is too large and it will take more time for the detector to understand
the image. In current approach, all the image frames are resized to width 227px and
height 227px (227 X 227).
Figure 3.6: Image Resize
3.3.4 Data Formatting
3.3.4.1 Storing images into volume
Plain 2D images only have width constraint and height constraint. It does not have
spatiotemporal characteristics and do not have motion constraints. In order to trigger
spatiotemporal characteristics, images are put into volumes. For example, 8 frames
of images put into one volume, producing 3D (time, width, height) data. With
spatiotemporal characteristics, it allows the detector to understand and learn the
motion of the video.
Figure 3.7: Image Frames Storing in Volume
34
3.3.4.2 Data Storage Formatting (HDF5)
As data storing in three dimensions, it will cause another problem which is curse of
dimensionality. Sparse dimension data will cause a lot of memory usage when we train
the detector. If we input the whole sparse data into the machine, even high specification
device might have memory error due to not enough memory. To overcome the curse
of dimensionality, the data is store in Hierarchical Data Format 5 (HDF5). Using
HDF5 format, the input volume data is by matrix, instead of input the whole sparse
data into the machine. The memory usage is much more lesser.
3.4 Evaluation Method
3.4.1 Training Phase Evaluation
3.4.1.1 Mean Square Error (MSE)
While implementing the model of the architecture, it is configured to calculate the loss
value. Loss value displayed each training epoch is mean square error (MSE), which is
the average of square of errors, difference between original result and estimated result.
For this project, an output will be reconstructed based on input. Comparison and
calculation will be make based on input and output based on loss, the smaller the loss
mean the model learns better. The ideal successful and effective learning process
should produce smaller and smaller loss value in every epoch.
3.4.2 Classification Phase Evaluation
3.4.2.1 Regularity Score
Based on the output of the dataset reconstructed after passing through the autoencoder,
it will be used to compare with the original result to calculate regularity score. If the
regularity score is lower than the benchmark, it means an abnormality occurred in the
scene.
35
3.4.2.2 Area Under Curve(AUC), Error Equal Rate(EER), Receiver Operating
Characteristic (ROC) Curve
Accuracy of the system is very important. By using ROC Curve, we able to rate and
calculate the positive predictive level of the system (Tran et al., 2015). ROC Curve is
plotted using True Positive Rate (TPR) and False Positive Rate (FPR). Error Equal
Rate is obtained when the False Positive Rate meets the False Negative Rate. The
smaller the EER, the better the performance. Table below shows the confusion matrix
of the result, where true positive is abnormal event detected, false positive is false
alarm triggered, true negative is normal event detected and false negative is missed
abnormal event.
Table 3.1: Confusion Matrix
Predicted
Positive Negative
Actual Positive True Positive(TP) False Negative(FN)
Negative False Positive(FP) True Negative(TN)
Table 3.2: Confusion Matrix with Description
Predicted
Positive Negative
Actual Positive Anomalous Event Missed Event
Negative False Alarm Normal Event
Figure 3.8: ROC Curve
36
By using TPR and FPR, we are able to plot ROC curve and rate the
performance and positive predictive level of the model. Area under curve(AUC) will
be calculated at the same time TPR and FPR using Scikit-Learn library. If the model
has high positive predictive level, it will have a lot of true positive events and least or
no false positive events. Indirectly, model with high positive predictive level will not
or seldom trigger false alarm.
37
CHAPTER 4
4 EXPERIMENT, RESULT AND ANALYSIS
4.1 Dataset
Traffic scene used for this experiment is Plus Highway traffic scene. Video scene used
for training is during afternoon session, with normal light intensity. Training dataset
only contain normal traffic scene, where vehicle is travelling in 1 same direction and
no vehicle is stopping in the highway. Only traffic scene with normal driving traffic is
used in the training process, any other traffic scene is filtered during the training
process.
Figure 4.1: Sample Normal Traffic Scene
Figure 4.1 shows the sample video scene for training dataset with some car
driving in the scene.
38
4.1.1 Dataset Cropping
According to the original video dataset, the frame width and height is 720 X 576.
Figure 4.2: Original Traffic Scene
As shown in Figure 4.2, it is the original scene of the video dataset. The video
scene shows two-way traffic condition for Federal Highway. At the same time, as the
camera angle is projected quite high, the point of view included the sky, construction
site, tall building, and trees.
Figure 4.3: Static Area in Blue Highlighted
Figure 4.3 shows blue region highlighted the elements described previously.
After watching most of the video dataset, the highlighted blue region, covering nearly
35 percent of the entire camera view, does not have any changes in activity and remain
constant for almost 95 percent among all the video dataset. Even if there are activity
occurred at the blue region, it is difficult to be seen and identified as well, even for
human sight, because the distance is too far and the object become smaller and smaller.
39
As most of the blue region is remaining constant, it should be cropped out and
remove during training and testing phase. If whole scene was pumped into training
phase, the machine will learn that the constant blue region as normal scene. During the
testing phase, every pixel of the image frame takes into account for regularity score
calculation. As the blue region covers almost 35 percent, where abnormal event region
will be 5 percent to 10 percent only, it is possible that the constant blue region will
neutralize abnormal event and regularity score. This will affect the performance of the
machine and missed the event due to neutralization from the constant blue region.
Figure 4.4: Cropped Traffic Scene
In order to overcome the circumstances, the training and testing dataset is
cropped. Other than that, object that are far away and small also being cropped as well.
4.1.2 Data Augmentation
For training purposes, a series of video with normal scene only is required. By using 1
of the video as input data, pre-processing and filtering out abnormal events in the video
is done to generate the clean training video with normal scene only. As the training
video is around 30 minutes only, more data is needed to train the autoencoder model.
As the number of parameters in the autoencoder model is large, more amount of data
is required for training purposes where current size of given training video datasets is
not sufficient enough to train the model (Hasan , et al., 2016).
To increase the training dataset, augmentation is done. Data augmentation is
done in the temporal dimension of the data. The training dataset still having the same
shape with 8 consecutive frames, however with various type of strides skipping pattern.
In current augmentation approach, the input dataset is concatenated the frames with
40
stride-1, stride-2, stride-3 and stride-4. For stride-1 pattern, it samples the data volume
with consecutive frames, where in stride-2, stride-3 and stride-4, it skips/steps one
frame, two frames and three frames, forming pattern {1,2,3,4,5,6,7,8},
{1,3,5,7,9,11,13,15}, {1,4,7,10,13,16,19,22} and {1,5,9,13,17,21,25,29} respectively.
By using this approach, the training dataset was expanded from 4092 volumes
to 9795 volumes for better training purposes and experience.
4.2 Batch Processing
For testing purposes, an unseen video scene (1 Hour long) is processed into respective
format which is (X, 1, 8, 227, 227; X stands for total number of volumes). For a 1 hour
long video, it will be transformed into a HDF5 file with storage as large as 32GB.
During testing phase, a 32GB will be pumped into the memory, and the predicted
output is around 15GB.
After the testing phase, evaluation phase require data from original test data
(32 GB), predicted output (15 GB), total up for around 50 GB. Data is required for
regularity score calculation and ROC calculation. As the LXC container have only
limited to 40 GB memory, it is impossible for the PC to hold that much data in limited
memory, it will kill the process when the memory is fully utilized.
Figure 4.5: Job Killed Due to Memory Limit
As volumized data require large amount of space and memory, batch loading
is required. Batch loading do not require to load all data into the memory in one time,
booming the memory space. Based on current approach, each batch only loads for
1000 volumes and do calculation for 1000 volumes in each batch, where the memory
usage is still within the utilization range.
41
4.3 Evaluation Phase
4.3.1 Training Phase Visualization
Figure 4.6: Training Epochs Loss Value
By using the augmented training video dataset, the autoencoder was trained for 50
epochs. The loss value recorded is mean squared error (MSE), measures the average
of square of error/deviations, representing how well is the model learning on the
features for the input dataset. The lower the loss value, the better the model learns the
feature. The decreasing loss value trend also shows that the model able to learn better
after each epoch. However, having practically low loss value during training phase
does not mean the model able to produce accurate result during testing. Testing is
required by using unseen dataset on the model.
During the training phase, each training epoch will save the model weight, as
every different weight have different loss value. During the testing phase, the trained
model weight with the lowest loss value is used.
42
4.3.2 Reconstruction Error
By using the trained model, the model will predict and analyse an unseen data. The
output from the prediction and analysis will be the same format as the input size, which
is 9000 X 1 X 8 X 227 X 227).
Reconstruction error is calculated by calculating Euclidian’s Distance for each
frame between original dataset and predicted output. To get the reconstruction error
for one frame, Euclidian’s Distance for each pixel is calculated, and summation of
Euclidian’s Distance for every pixel for the particular frame will be the reconstruction
error the frame.
𝑒(𝑡, 𝑥, 𝑦) = ‖𝐼(𝑡, 𝑥, 𝑦) − 𝑓𝑤(𝐼(𝑡, 𝑥, 𝑦))‖2
𝑒(𝑡) = ∑ 𝑒(𝑡, 𝑥, 𝑦)(𝑥,𝑦)
where
𝑒 = Reconstruction Error
𝑡, 𝑥, 𝑦 = Frame number, Width/X position and Height/Y position
𝐼 = Original Test Data
𝑓𝑤 = Predicted Output from Test Data
𝑒(𝑡) = Summation of Reconstruction Error for every pixel in frame 𝑡
4.3.3 Normalization
Each reconstruction error is normalized using the formula below, scaling the
reconstruction error from 0 to 1 only. Normalization allows easier visualization for
regularity score.
𝑒𝑖 =𝑒𝑖−𝐸𝑚𝑖𝑛
𝐸𝑚𝑎𝑥− 𝐸𝑚𝑖𝑛
Where
𝐸𝑚𝑖𝑛 = Minimum value for variable E
𝐸𝑚𝑎𝑥 = Maximum value for variable E
43
4.3.4 Regularity Score
To get the regularity score of the frame, reconstruction error will be minus by
1. Frame with high regularity score means the frame is normal, if the frame has
practically low regularity score, the frame is abnormal.
𝑠(𝑡) = 1 − 𝑒(𝑡)
Where
𝑠(𝑡) = Regularity Score for frame 𝑡
4.4 Result
Raw unseen dataset will undergo transformation into volumized required input shape
format by the model for testing purposes. Each raw unseen dataset is 1 hour long,
generating 72000 frames at 20 frame/sec rate. After prediction, regularity score will
be calculated for each image frame of the video. 0.4 will be the benchmark set for
anomality detected, any image frame that has regularity score lower than 0.4 will be
consider as abnormal event/traffic scene.
4.4.1 Video Version 1
Version 1 is the first dataset used for anomaly detection. For video version 1, the light
intensity of the scene is the same with the training dataset. Figure 4.7 shows the
regularity score calculated for each frame after prediction.
Figure 4.7: Regularity Score Calculated for Video Version 1
44
As mentioned, this dataset is used because there are buses that will stops by the
highway to load/unload passengers, this is the main anomaly event that was planned
to catch. Based on current prediction, scenes with large vehicle are detected as anomaly
as well.
Figure 4.8: Large Vehicle Detected in Video Version 1
As in training dataset, it contains scene with large vehicle such as lorry and
truck, but the shape is different with the one as predicted. Unexpectedly generating
positive result, the large vehicle detected as shown are large vehicles that the model
has never seen before.
The other scene that detected as bus stopping by the highway event. Event
involves consecutive process which is bus slowing down by the highway, stopped,
load/unload passenger, start departing and leaving from the camera view. All processes
are detected as anomaly by the model.
Figure 4.9: Bus Stopping Event Detected in Video Version 1
45
Figure 4.10 shows other bus stopping events detected in current dataset version.
Figure 4.10: Other Bus Stopping Event Detected in Video Version 1
Figure 4.11 shows ROC Curve, AUC calculated and EER calculated for Video
Version 1.
Figure 4.11: ROC Curve for Video Version 1
46
4.4.2 Video Version 2
Version 2 is the second dataset used for anomaly detection. For video version 2, the
light intensity of the scene is the same with the training dataset. Figure 4.12 shows the
regularity score calculated for each frame after prediction.
Figure 4.12: Regularity Score Calculated for Video Version 2
Below are the example scenes that have regularity score lower than 0.4. For
the first situation detected, the model detected a new abnormal event. Instead of
detecting bus stopping event, the model able to detect smaller vehicle stopping event
(Taxi). The event detected includes consecutive events which is taxi slowing down,
stopped, passenger approaching taxi, passenger talking and asking for price with the
taxi driver. The entire event is detected as anomaly.
Figure 4.13: Taxi Stopping Event Detected in Video Version 2
47
For the second situation detected in version 2, the scene includes bus stopping
event and a sedan car stopping behind the road signboard. As the bus stopping event
and sedan car stopping event occurred together, it was considered that the model is
detecting the bus stopping event where unable to detect sedan car stopping event which
is not obvious and partially blocked by the signboard. According to the original dataset,
the bus arrived, stopped and departed at the scene, where the sedan car still stopped
behind the scene after the bus have departed. However, the predicted output result
shows that the model able to detect the sedan car stopping behind the signboard. This
shows that the model is capable to detect abnormal event even is partially blocked by
a static object such as signboard.
Figure 4.14: Bus Stopping Event and Vehicle Stopping Behind Signboard in Video
Version 2
Other abnormal events such as bus stopping event and large vehicle event also
able to be detected by the same model in different dataset.
Figure 4.15: Large Vehicle Event Detected in Video Version 2
48
Figure 4.16: Bus Stopping Event Detected in Video Version 2
However, in Version 2 dataset, most of the false alarm occurred in the traffic
condition when there are less vehicle driving in the scene.
Figure 4.17: Traffic Condition with Less Car
Figure 4.18 shows ROC Curve, AUC calculated and EER calculated for Video Version
2.
Figure 4.18: ROC Curve for Video Version 2
49
4.4.3 Video Version 3
Version 3 is the third dataset used for anomaly detection. For video version 3, the light
intensity of the scene is different with the training dataset. The estimated time frame
for the dataset is around 7.00 a.m. to 8.00 a.m. In the first 30 minutes, the light intensity
is dimmer as the sun still haven’t rise. After the first 30 minutes, the sun started to rise
periodically.
Figure 4.19 shows the regularity score calculated for each frame after
prediction. Based on the result, the detection behaviour became extensively abnormal
at the second 30 minutes of the dataset.
Figure 4.19: Regularity Score Calculated for Video Version 3
At the beginning part of the video, the sun still haven rise and the light intensity
is slightly lower than the training dataset. On behalf of that, some of the video scene
recorded is a bit blur, compared with the training dataset which is clearer. Even though
the some of the video scene is blur, the model still able to detect abnormal events.
Figure 4.19 shows large vehicle events detected even though the scene recorded is
blurrier than the usual scene.
50
Figure 4.20: Large Vehicle Detected in Blur Condition in Video Version 3
Large vehicle event detected by model in video version 3 shown in Figure 4.21.
Figure 4.21: Large Vehicle Detected in Video Version 3
Figure 4.22 shows event detected with two large vehicles at the same time.
Figure 4.22: Two Large Vehicle Detected in Video version 3
51
For the second 30 minutes of the video, the sun started to rise, estimating
around 7.30 a.m. The result shows all the frames after 7.30 a.m. having low regularity
score and detected as anomaly. There are few factors that causes this to occur.
As the sun rises, more and stronger light ray is emitted and seen in the scene.
Unfortunately, the light ray has coincidently reflected into the convex lens of the
camera. Reflection of light inside the lens produces a light focal point mark in the
camera view. As a result, the model has never seen this situation before, therefore
predicting this kind of situation as abnormal event, having practically low regularity
score.
Secondly, the camera is facing east direction, which is the position of the run
rises. While sun rises periodically from east direction, where vehicle is approaching
west direction, it will produce more and more shadow for each object. For instance,
shadow of the car and shadow of the billboard. During the training process, the training
dataset do not contain scene with object shadow, estimating the time is around noon,
where no/little shadow on each and every object.
As shown in Figure 4.23, when sun rises periodically, the initial reflection focal
point in the camera lens become more obvious. At the same time, the shadow of each
object become smaller when the sun rises higher and higher.
Figure 4.23: Scene with Light Reflection and Shadow in Video Version 3
52
Figure 4.24 shows ROC Curve, AUC calculated and EER calculated for Video
Version 3.
Figure 4.24: ROC Curve in Video Version 3
4.5 Ground Truth Labelling Method
During the first ground truth attempt, the ground truth labelled abnormal event
(abnormal = 1) even though the abnormal event is not obvious and very small.
As shown in the Figure 4.25, on the left, there is 1 passenger waiting
and standing by the highway, where on the left, it shows a human is crossing
the highway from left to right. These are some unobvious abnormal event that
labelled as abnormal.
Figure 4.25: Scene with Unobvious Anomaly
53
In the version 1 dataset, even though the model able to detect abnormal event,
but the AUC is extensively low, at 0.548 only. Low AUC indirectly means the model
in inefficient. Having AUC 0.548, is like doing random guessing without knowledge.
Based on the analysis of the detected scene, it shows that all the scene detected
are scene with obvious difference with the normal traffic scene. Unlike scene as shown
in Figure 4.26, human walking, none of these events can be detected.
By adjusting the ground truth values, where only setting obvious abnormal
scene as abnormal and setting unobvious abnormal scene as normal. This adjustment
has greatly increased the AUC from 0.548 to 0.769 and EER dropped from 0.463 to
0.312.
Figure 4.26: ROC Curve Improvement with Obvious Scene Only
In short, this adjustment means that the model only able to detect obvious
abnormal scene and larger of object of interest.
4.6 Video Average Handling Method
As seen in all previous regularity score result, each consecutive video frame is actually
almost similar, however the model has calculated the difference of regularity score
between each frame having big gap, where around 0.4 variance. The variance between
each consecutive frame should not have such a big difference, since that each
consecutive frame is near to similar.
To overcome this issue, data handling and reconstruction error calculation have
been handle with overlapping and average handling approach. For the original
processing attempt for testing dataset, the volume dataset was stored consecutively
with no overlapping with pattern: {1,2,3,4,5,6,7,8}, {9,10,11,12,13,14,15,16}, …,
54
{71993,71994,71995,71996,71997,71998,71999,72000}. Reconstruction error also
highly dependent on each frame of each volume only.
In current approach, data was processed and stored via overlapping current
volume’s last 4 continuous frames with next volume’s first 4 continuous frames. The
first 4 frames (1,2,3,4) and last 4 frames (71997,71998,71999,72000) do not have
overlap with any other volume because first volume do not have previous volume and
last volume do not have next volume.
This approach have doubled up the size of the testing dataset to 18000 volumes.
During evaluation phase, reconstruction error for each frame is obtained on the average
of reconstruction error on predicted result of same frame number in different volume.
For example, to obtain reconstruction error for frame number 9, the total reconstruction
error is obtained from the sum of 5th frame in the second volume and 1st frame in the
third volume. The final reconstruction error for frame 9 is total reconstruction error
divided by 2.
Based on the result generated, overlapping and average handling shows the
variance in regularity score for each consecutive frame have been reduced around 50
percent as compared with the direct approach in storing testing dataset and calculating
regularity score. The variance of regularity score of continuous frame have become
from around 0.4 to around 0.2.
The generated regularity score via average handling method produced same
graph curve with dataset with normal storing approach. However, the reconstruction
error and regularity score for each frame have been reduce, resulting frames that have
regularity score lower than 0.4 in normal data handling approach having regularity
Figure 4.25
Figure 4.27: Overlapping Data Handling
55
score higher than 0.4 in overlapping data handling approach. This have reduced the
amount of anomaly detected in overlapping handling approach, making the model to
be more precise in each event.
Same applies to ROC Curve which calculated using reconstruction error, each
reconstruction error is more accurate and precise, having nearer range with the ground
truth, resulting better AUC and EER in ROC calculation.
56
Figure 4.28: Variance of Regularity Score Enhancement for Video Version 1
Figure 4.29: Variance of Regularity Score Enhancement for Video Version 2
Figure 4.30: Variance of Regularity Score Enhancement for Video Version 3
57
Figure 4.31: ROC Curve Enhancement for Video Version 1
Figure 4.32: ROC Curve Enhancement for Video Version 2
Figure 4.33: ROC Curve Enhancement for Video Version 3
58
4.7 Summary
4.7.1 Result Summary
Table 4.1: Result Summary
Dataset Expected
Anomalous Event
Predicted
Anomalous Event
False
Alarm
AUC EER
1 31 12 1 0.769 0.312
1 (Average Method) 31 7 0 0.812 0.252
2 32 24 16 0.697 0.363
2 (Average Method) 32 6 4 0.751 0.316
3 37 14 ∞ 0.547 0.443
3 (Average Method) 37 2 ∞ 0.558 0.452
Based on the result summary, it shows even the model have a lot of missed
detected events (False Negative), but the AUC is still quite high. Based on inspection
of missed events for 3 datasets, the missed events fall into the same category, which is
large vehicle events. The model prediction behaviour is inconsistent, some of the large
vehicle event is detected as anomaly where some detected as normal.
At the same time, across the entire large vehicle in view event, it takes around
3 secs only. Missed in detecting these events will only decrease small amount of AUC
and increase small amount of EER as the duration is very short. However, the model
able to detect vehicle stopping event by the highway shoulder successfully for dataset
version 1 and version 2. Whole consecutive process is detected as anomaly and the
duration is long, approximately 30 secs – 90 secs. Detecting long and consecutive
event able to boost the AUC and EER as it involved large quantity of frames.
Other than that, the model trigger false alarm where detecting normal traffic
scene as anomaly. Based on inspection, the false alarm conditions are normal traffic
scene and traffic scene with lesser vehicle at the same time. For dataset version 3, sun
rises at the second half of the video, causing the model the detect every frame after the
second 30th minute of the video as anomaly, generating uncountable amount of false
alarm.
By using average and overlapping data method, the result shows that the
quantity of false alarm decreases, but the anomalous event detected decreases as well.
59
When average takes into account, the reconstruction error become smaller as compared
with the normal approach, indirectly making the model prediction to be more precise
and accurate for each frame. This is also the reason for increasing the AUC and
reducing the EER. By using overlapping approach, the result shows that the output is
more precise, major abnormal event which is vehicle stopping event able to be detected
in dataset version 1 and 2, and quantity of minor abnormal event which is large vehicle
event drastically reduced in dataset version 1, 2 and 3.
4.7.2 Model Summary
Based on the result predicted from the model in 3 different datasets, it shows the model
have the capability to detect anomalous event in the traffic scene. However, the model
only able to detect abnormal event with obvious object of interest, such as bus stopping
event, taxi stopping event, sedan car stopping event and large vehicle event. For
abnormal event with smaller object of interest, such as human crossing the highway
and human standing/waiting by the highway shoulder, the model unable to detect such
anomaly event as the object of interest is smaller compared with vehicles, the model
is unable to “see” the existence of human.
On the other hand, based on testing on dataset version 3, there is light reflection
of sun ray in the camera lens when the sun rises. Reflection is the normal activity of
the nature. In normal human lifestyle, we see reflection everywhere, as it produces no
harm to us. However, for the point of view from the model, it does not have heuristic
knowledge and understand that it is just a reflection of light. Instead of considering it
as normal nature event, the model detected it as abnormal event. Based on the result,
it inferred that current model unable to handle nature events such as sun ray reflection.
Other than light reflection, there are events such as weather change, raining, thunder
storm and etc. Training with scene consisting such nature events is required in order
to efficiently handling nature events.
CHAPTER 5
5 CONCLUSION AND RECOMMENDATIONS
5.1 Conclusion
In this paper, we have presented and proven that data-driven approach and
unsupervised learning method able to develop a traffic anomaly detector. The model
able to recognize, learn and detect abnormal events that have never seen before.
As CCTV camera are everywhere, a lot of valuable videos are recording, but it
requires a lot of time and effort to perform analysis and detection by human which is
very hectic. By using this approach, the model will be able to analyse and identify
meaningful events among all the events in the video, minimizing the human job and
minimal/no human supervision is required.
By using the spatiotemporal autoencoder, it receives 3D input, allowing the
model to learn each pixel by pixel and motion from consecutive frames in 1 volume.
Learning motion characteristics allow the understand more about the behaviour of the
event.
During anomaly identification, each frame predicted output will undergo
regularity score calculation to identify whether is anomaly. Frames that have lower
regularity score than the benchmark will be considered as anomaly. However, current
training dataset used only covers traffic scene with no natural practical issue such as
night, sun reflection and weather change. Handling anomaly detection in various
condition using the same model will be very challenging.
By using Plus Highway video dataset as the experimental use case situation,
the dataset has undergone pre-processing and learnt/trained by the autoencoder.
Autoencoder learns and understands the situational behaviour of the traffic, traffic
scene with low regularity score than then benchmark, will be consider as traffic
behaviour that never seen before (anomalous event). Based on the detection output and
calculation, the result shows the autoencoder is able to detect anomalous scene,
achieving on average 0.7 AUC based on different datasets.
5.2 Recommendations for future work
5.2.1 Deeper Neural Network for Autoencoder Model
In the current model, the model only able to detect anomaly event with obvious object
of interest only, but not event with smaller object of interest. To overcome this issue,
the model need to add more layer in the spatial layer and the temporal layer, allowing
the model to access the pixel deeper, accessing and learning the features more precisely
and accurately in every pixel of the image.
5.2.2 Train Model with Larger Dataset
For the current training dataset, it only consists of around 30 minutes long of normal
scene training dataset. To enhance the performance of the model, the model should be
train with more and longer training dataset consisting more normal traffic scene.
Training with more dataset allow the model to learn variety of normal traffic scene and
reduce the quantity of false alarm triggered.
5.2.3 Enhance Data Augmentation Technique
In the current data augmentation technique, we added stride-1, stride-2, stride-3, stride-
4 to augment and increase the dataset. This approach does not increase the amount of
image frames. To augment and increase the image frames, the image can be augmented
via adding blur effect and rotation and translation of the image for around 1 to 2 degree
from all direction. Rotation and translation able to handle the situation like wind blow
or strong wind, where the camera will be shaky and the view will be moved for
approximately 1 to 2 degrees as well. Augmentation on image frames indirectly
increases the number of volumes of training dataset.
5.2.4 Dataset Expansion Covering Different Natural Situation
The model only able to handle normal traffic situation with minimal/no natural
practical events. In nature, sun will rise and set, rain, thunderstorm and etc., producing
morning, afternoon, night and weather change scenario. At night, driver will switch on
the head light where the head light will strike on the road surface and the street light
will be on as well. More noise is generated as compared with the normal situation.
Figure 5.1: Night Traffic Scene
Other than that, during rainy day, the camera view will see many water
dropping and unable “see” the situation clearly, generating even more noise. For more
unfortunate case, if the water droplet falls on the camera screen, it will be a water mark
blocking the camera view as well.
In short, data expansion consisting morning scene, afternoon scene, night scene
and weather change in order to improve the model capability in handling various types
of natural practically issues and events.
5.2.5 Training Approach for Handling Different Situation
However, to improve the model by training new and different situation, configuration
on training hyperparameter need to be done. In Keras, we can use callback functions
such as LearningRateScheduler and ReduceLROnPlateau. For LearningRateScheduler,
the initialized learning rate will decrease by defined factor after each training epoch
where ReduceLROnPlateau will only decreased the learning rate by defined factor
when the loss value is constant and the model has stopped improving for 𝑛𝑡ℎ number
of epoch. At the same time, the learning rate need to initialized smaller than the original
training model learning rate. Using learning rate callback function is required to
prevent original important heuristic knowledge is being overwritten by the new
information and knowledge.
REFERENCES
Anon., 2016. CCTV Facial Recognition & Video Analytics Software Systems - Essex
&UK. [Online] Available at: http://www.clearview-communications.com/cctv/facial-
recognition-video-analytics [Accessed 4 July 2016].
Butler, Z., Corke, P., Peterson, R. and Rus, D., 2006. From Robots to Animals: Virtual
Fences for Controlling Cattle. USA, Computer Science and Artificial Intelligence
Laboratory, MIT, pp. 1-24.
Chollet, F., 2016. Building Autoencoder in Keras. [Online]
Available at: https://blog.keras.io/building-autoencoders-in-keras.html
[Accessed 4 August 2016].
Collier, A., 2015. Making Sense of Logarithmic Loss. [Online]
Available at: https://www.r-bloggers.com/making-sense-of-logarithmic-loss/
[Accessed 29 July 2016].
Dalli, A., n.d. Intelligent Traffic Scene Analysis. [Online]
Available at: http://traffiko.com/press/articles/intelligent-traffic-scene-analysis/
[Accessed 7 July 2016].
Hasan , M. et al., 2016. Learning Temporal Regularity in Video Sequences. UC,
Riverside, ARXIV, p. 1-40.
Honovich, J., 2008. Top 3 Problems Limiting the Use and Growth of Video Analytics.
[Online] Available at: https://ipvm.com/reports/top-3-problems-limiting-the-use-and-
growth-of-video-analytics [Accessed 1 July 2016].
Keras, 2016. [Online] Available at: keras.io [Accessed 11 August 2016].
Kohn, A., 2014. Brain Science: Focus-Can You Pay Attention?. [Online]
Available at: http://www.learningsolutionsmag.com/articles/1440/brain-science-
focuscan-you-pay-attention [Accessed 25 July 2016].
Libav, 2016. [Online] Available at: https://libav.org/avconv.html [Accessed 28 July
2016].
Narciso, G., 2014. The Evolution of Video Analytics: Past Failures to Accurate Crime
Preventing Tool. [Online]
Available at: http://avigilon.com/news/innovation/the-evolution-of-video-analytics-
past-failures-to-accurate-crime-preventing-tool/ [Accessed 2 July 2016].
Patraucean, V., Handa, A., & Cipolla, R. (2016, September 1). Spatio-Temporal Video
Autoencoder With Differentiable Memory. Retrieved from Arxiv:
https://arxiv.org/pdf/1511.06309.pdf [Accessed 2 August 2016].
Tran, D. et al., 2015. Learning Spatiotemporal Features with 3D Convolutional
Networks. Retrieved from Arxiv: https://arxiv.org/pdf/1604.04574v1.pdf [Accessed 2
January 2017].