deep learning approaches for predictive masquerade...

25
Research Article Deep Learning Approaches for Predictive Masquerade Detection Wisam Elmasry , 1 Akhan Akbulut , 2 and Abdul Halim Zaim 1 1 Department of Computer Engineering, Istanbul Commerce University, Istanbul, Turkey 2 Department of Computer Engineering, Istanbul Kultur University, Istanbul, Turkey Correspondence should be addressed to Wisam Elmasry; [email protected] Received 21 March 2018; Accepted 24 June 2018; Published 1 August 2018 Academic Editor: Mamoun Alazab Copyright © 2018 Wisam Elmasry et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In computer security, masquerade detection is a special type of intrusion detection problem. Effective and early intrusion detection is a crucial factor for computer security. Although considerable work has been focused on masquerade detection for more than a decade, achieving a high level of accuracy and a comparatively low false alarm rate is still a big challenge. In this paper, we present a comprehensive empirical study in the area of anomaly-based masquerade detection using three deep learning models, namely, Deep Neural Networks (DNN), Long Short-Term Memory Recurrent Neural Networks (LSTM-RNN), and Convolutional Neural Networks (CNN). In order to surpass previous studies on this subject, we used three UNIX command line-based datasets, with six variant data configurations implemented from them. Furthermore, static and dynamic masquerade detection approaches were utilized in this study. In a static approach, DNN and LSTM-RNN models are used along with a Particle Swarm Optimization-based algorithm for their hyperparameters selection. On the other hand, a CNN model is employed in a dynamic approach. Moreover, twelve well-known evaluation metrics are used to assess model performance in each of the data configurations. Finally, intensive quantitative and ROC curves analyses of results are provided at the end of this paper. e results not only show that deep learning models outperform all traditional machine learning methods in the literature but also prove their ability to enhance masquerade detection on the used datasets significantly. 1. Introduction In computer security domain, a masquerader is defined as an intruder seeking to mimic a genuine client. A masquerade attack takes place when a masquerader gets unauthorized access to a legitimate user’s information by using his legiti- mate access credentials. ese attacks are considered being among the most serious threats to computer security. e most effective way to prevent such attacks is using intrusion detection systems (IDSs) which can provide monitoring for all users and search for any abnormal conducts [1]. Computer security design incorporates with two com- mon approaches of IDSs: signature-based detection and anomaly-based detection. Signature-based detection or also called misuse detection is valuable to use when the mas- querade attack signature is already known. Alternatively, anomaly-based detection can be used for either known or unknown masquerade attacks. is advantage makes anomaly-based detection approach popular and a vast amount of prior studies has been published on this topic in the last decade [2]. e main idea behind anomaly- based detection approach is profiling the user behavior with collecting a variety of information about each user and then using this information to create a profile for each user depending on some characteristics. When the system is used, a security check is occurring to compare the recent activities done by the user with the original profile. If the user behavior deviates from the normal existing profile, then the session is classified to be as a possible masquerade attack. ere are many anomaly-based detection techniques that are used, but among them, machine learning methods are the most commonly used approaches due to their ability to learn from data and then distinguish between normal and malicious users [3]. Despite the popularity of using traditional (shallow) machine learning methods for classification tasks, these have many deficiencies that need to be addressed, such as the per- spective of full features representation, the complexity of the Hindawi Security and Communication Networks Volume 2018, Article ID 9327215, 24 pages https://doi.org/10.1155/2018/9327215

Upload: others

Post on 20-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

Research ArticleDeep Learning Approaches for Predictive Masquerade Detection

Wisam Elmasry 1 Akhan Akbulut 2 and Abdul Halim Zaim1

1Department of Computer Engineering Istanbul Commerce University Istanbul Turkey2Department of Computer Engineering Istanbul Kultur University Istanbul Turkey

Correspondence should be addressed to Wisam Elmasry wisamelmasryistanbulticaretedutr

Received 21 March 2018 Accepted 24 June 2018 Published 1 August 2018

Academic Editor Mamoun Alazab

Copyright copy 2018 Wisam Elmasry et alThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

In computer security masquerade detection is a special type of intrusion detection problem Effective and early intrusion detectionis a crucial factor for computer security Although considerable work has been focused on masquerade detection for more than adecade achieving a high level of accuracy and a comparatively low false alarm rate is still a big challenge In this paper we presenta comprehensive empirical study in the area of anomaly-based masquerade detection using three deep learning models namelyDeep Neural Networks (DNN) Long Short-Term Memory Recurrent Neural Networks (LSTM-RNN) and Convolutional NeuralNetworks (CNN) In order to surpass previous studies on this subject we used three UNIX command line-based datasets withsix variant data configurations implemented from them Furthermore static and dynamic masquerade detection approaches wereutilized in this study In a static approach DNN and LSTM-RNNmodels are used along with a Particle SwarmOptimization-basedalgorithm for their hyperparameters selection On the other hand a CNN model is employed in a dynamic approach Moreovertwelve well-known evaluation metrics are used to assess model performance in each of the data configurations Finally intensivequantitative and ROC curves analyses of results are provided at the end of this paper The results not only show that deep learningmodels outperform all traditional machine learning methods in the literature but also prove their ability to enhance masqueradedetection on the used datasets significantly

1 Introduction

In computer security domain a masquerader is defined as anintruder seeking to mimic a genuine client A masqueradeattack takes place when a masquerader gets unauthorizedaccess to a legitimate userrsquos information by using his legiti-mate access credentials These attacks are considered beingamong the most serious threats to computer security Themost effective way to prevent such attacks is using intrusiondetection systems (IDSs) which can provide monitoring forall users and search for any abnormal conducts [1]

Computer security design incorporates with two com-mon approaches of IDSs signature-based detection andanomaly-based detection Signature-based detection or alsocalled misuse detection is valuable to use when the mas-querade attack signature is already known Alternativelyanomaly-based detection can be used for either knownor unknown masquerade attacks This advantage makesanomaly-based detection approach popular and a vast

amount of prior studies has been published on this topicin the last decade [2] The main idea behind anomaly-based detection approach is profiling the user behavior withcollecting a variety of information about each user andthen using this information to create a profile for each userdepending on some characteristics When the system is useda security check is occurring to compare the recent activitiesdone by the user with the original profile If the user behaviordeviates from the normal existing profile then the sessionis classified to be as a possible masquerade attack Thereare many anomaly-based detection techniques that are usedbut among them machine learning methods are the mostcommonly used approaches due to their ability to learn fromdata and then distinguish between normal and malicioususers [3]

Despite the popularity of using traditional (shallow)machine learning methods for classification tasks these havemany deficiencies that need to be addressed such as the per-spective of full features representation the complexity of the

HindawiSecurity and Communication NetworksVolume 2018 Article ID 9327215 24 pageshttpsdoiorg10115520189327215

2 Security and Communication Networks

problem and limitation to static classification applications[4] In 2006 a new concept of representation learning basedon Artificial Neural Network called deep learning has beenput forwardDeep learning is considered as a class ofmachinelearning techniques that has in hierarchical architecturesmany layers of information processing stages for patternrecognition or classification Rather than overcoming theformer deficiencies of shallow machine learning methods itachieves recently great success in many research fields Themain advantages of deep learning can be summarized asits practicability having the ability to unsupervised featurelearning or extraction from datasets and having strong self-learning capability [5] There are four typical models ofdeep learning namely Autoencoder (AE) Deep Belief Net-works (DBN) Convolutional Neural Networks (CNN) andRecurrent Neural Networks (RNN) Because of deep learningsuccess and stability it has been actively and continually usedin a wide range of applications nowadays such as computervision natural language processing and intrusion detectionsystems [4 5]

To our knowledge neither of the previous studies onthe area of the masquerade detection used deep learning toutilize its great capabilities and various learning models Theaim and contribution of this research are sixfold as follows(i) we performed a comprehensive empirical study whichinvestigates the effectiveness of three binary classificationdeep learning models to detect masqueradersrsquo attacks (ii)the first study uses three famous UNIX command linedatasets with their different (six) data configurations andcompares between them (iii) we proposed a Particle SwarmOptimization-based algorithm for DNN hyperparametersselection (iv) we carried out our experiments on all dataconfigurations using both static and dynamic masqueradedetection approaches (v) we assessed the performance ofthe used deep learning models using twelve well-knownevaluation metrics Wilcoxon and Friedman statistical testsand ROC analysis (vi) we made comparisons between deeplearning modelsrsquo results and the best results of the traditionalmachine learning methods that have been published in theliterature in the field of masquerade detection

The rest of this paper is organized as follows Section 2reviews the relatedwork that has been published previously inthe area of masquerade detection using traditional machinelearning methods and UNIX command line datasets ThenSection 3 describes UNIX command line datasets and theirdata configurations in detail Section 4 presents a ParticleSwarm Optimization-based algorithm to select hyperparam-eters of Deep Neural Networks (DNN) Section 5 shows howour experiments are established and what deep learningmodels are used and Section 6 presents which evaluationmetrics are used as well as it analyzes the gained experimentalresults Finally Section 7 presents our conclusions and possi-ble future work

2 Related Work

Masquerade detection has been actively researched in thelast decade due to its significance and vulnerability to thecomputer security area For the sake of brevity and restriction

of the scope of this study we have principally focused onanomaly-based masquerade detection using machine learn-ing approaches and well-known UNIX command line-baseddatasets in the literature

It was firstly introduced by Schonlau et al [7] whenthey proposed a UNIX command line-based dataset calledSEA They also utilized various statistical methods on SEAdata configuration and compared the results In short timeSEA dataset becomes very popular in the field of anomaly-based masquerade detection techniques T Okamoto et al[8] presented an immunity-based Hidden Markov Modelon SEA data configuration and they got 60 Hit and 1False Alarm Rate (FAR) Naive Bayes is a famous classifierthat is working well with text classification tasks It wasfirst applied on SEA data configuration by Roy A Maxionand Tahlia N Townsend in 2002 [9] with two models onewith updating users profile (Hit=615 FAR=13) and theother with no-updating (Hit=662 FAR=46) Moreoverthey proposed a new data configuration from SEA datasetnamed SEA 1v49 and also tested Naive Bayes classifier withupdating on SEA 1v49 data configuration and they had 628Hit and 46 FAR K Wang et al in [10] implemented onSEA data configuration a Naive Bayes classifier (Hit=70FAR=2) andOne-Class Support VectorMachine (OCSVM)model (Hit=70 FAR=4) In the study [11] K H Yungpresented in his work a Naive Bayes classifier with updatingand feedback which has applied to SEA data configuration(Hit=76 FAR=2) He developed his previous work andproposed a self-consistent Naive Bayes model with updatingon SEA data configuration in 2004 [12] He had better resultsand increased Hit to 79 but FAR is still 2

Support Vector Machine (SVM) is also a well-knownmachine learning method that is used for both classificationand regression Chen and Aritsugi introduced a SVM-basedmethod formasquerade detectionwith online updating usingEigen Cooccurrence Matrix which is applied to SEA dataconfiguration [13] They tested their proposed method forOne-Class (Hit=6277 FAR=6) as well as for Two-Class(Hit=7224 FAR=3) classification models In 2006 ZLi et al extracted user behaviorrsquos principle features fromCorrelation Eigen Matrix using Principle Component Anal-ysis (PCA) then they fed these features to SVM-basedmasquerade detection system on SEAdata configuration [14]They got a very good result with Hit=826 and FAR=3H S Kim and S D Cha performed an empirical studyin the field of masquerade detection using SVM classifierwith a voting engine [15] They tested their SVM classifieron two UNIX command line-based datasets namely SEAdataset and Greenberg dataset [16] which latter is proposedby Greenberg in 1988 For SEA dataset they applied theirSVM classifier on two different data configurations namelySEA data configuration (Hit=801 FAR=97) and SEA1v49 data configuration (Hit=948 FAR=0) In additionto that they applied their SVM classifier on two differentdata configurations for Greenberg dataset namely Green-berg Truncated and Greenberg Enriched data configurationswhich are proposed byMaxion [17] ForGreenbergTruncateddata configuration they had Hit=711 and FAR=6 mean-while they had Hit=873 and FAR=64 for Greenberg

Security and Communication Networks 3

Table 1 Best results of the related works

Model Dataset Configuration Hit () FAR ()HMM SEA SEA 60 1

Naive BayesSEA SEA 79 2

SEA 1v49 628 46

Greenberg Greenberg Truncated 709 47Greenberg Enriched 821 57

Conditional NaiveBayes

SEA SEA 84 88SEA 1v49 907 1

Greenberg Greenberg Enriched 8413 94PU PU Enriched 84 8

SVM

SEA SEA 826 3SEA 1v49 948 0

Greenberg Greenberg Truncated 711 6Greenberg Enriched 873 64

PU PU Enriched 60 2Tree-based PU PU Enriched 85 10

Table 2 Datasets and their characteristics

Dataset Name HostsPlatform

No ofUsers

AuditFormat Enriched Contaminated Sessions Real

Masquerades Year

SEA Unix 50 UnixCommands No Yes No No 2001

Greenberg Unix 168 UnixCommands Yes No Yes No 1988

PU Unix 8 UnixCommands Yes No Yes No 1997

Enriched data configuration In 2007 Yang et al presenteda One-Class SVM with string kernel classifier to detectmasquerade attacks [18] They tested their classifier on twoUNIX command line-based datasets namely SEA datasetand PU dataset [19] which latter is proposed by Lane andBrodley in 1997 For SEA dataset they applied their modelon SEA data configuration (Hit=62 FAR=15) and forPU dataset they applied their model on PU Enriched dataconfiguration (Hit=60 FAR=2) which is proposed [19]

In the study [17] a Naive Bayes model with updatingusers profile is introduced in 2003 on both Greenberg Trun-cated and Greenberg Enriched data configurations whereasGreenberg Truncated data configuration gave a Hit=709and a FAR=47 andGreenberg Enriched data configurationgave a Hit=821 and a FAR=57 Gebski and Wong [20]presented a tree-based model for masquerade detectionon PU Enriched data configuration (Hit=85 FAR=10)REDDY et al proposed a conditional Naive Bayes classifierto detect masquerades [21] They tested their classifier onthree different UNIX command line-based datasets namelySEA Greenberg and PU datasets For SEA dataset theyapplied their classifier on two data configurations namelySEA data configuration (Hit=84 FAR=88) and SEA 1v49data configuration (Hit=907 FAR=1) For Greenbergdataset they applied their classifier on Greenberg Enriched

data configuration (Hit=8413 FAR=94) Finally theytested their classifier on PU Enriched data configurationand they got a Hit=84 and a FAR=8 Table 1 presents asummarization of the best results of the previous works abovein terms of Hit percentage for each dataset As we can noticefrom Table 1 developing a masquerade detection models forhigher Accuracy and Hit as well as lower FAR values is still abig challenge

3 Datasets and Configurations

This section describes the datasets that we used in our studydata configurations and the methodology of training andtesting as well Indeed there are various mechanisms thatcould be used to collect information about each user tomodel his behavior and then build his normal profile such asuser command lines history graphical user interface (GUI)user file system navigation and system calls at the operatingsystem level In this paper we selected three datasets basedon UNIX command line history of users namely SEAGreenberg and PU Rather than being free and publiclyavailable on Internet they are the most commonly useddatasets in anomaly-based masquerade detection area soour results will be easily compared to previous ones Table 2shows datasets and their characteristics

4 Security and Communication Networks

31 SEA Dataset Recently published papers that focused onmasquerade detection area used this dataset SEA (SchonlauEt Al) is a free UNIX command line-based dataset [7] Theyused UNIX acct audit tool to collect commands from 50different users for several months SEA dataset contains aset of 15000 commands for every user and these commandscontain only command names issued by that user For eachuser the set of 15000 commands is divided into 150 blockseach with 100 commands The first 50 blocks for each userare considered genuine and used as a training set Theremaining 100 blocks of each user are considered as a testset Some of the test blocks are contaminated randomly withdata of other users ie each user has varying masqueraderblocks in his test set from 0 to 24 blocks Two associateddata configurations have been used with this dataset in theliterature SEA and SEA 1v49

311 SEA This data configuration is proposed in the study[7] A separate classifier is built for each of the 50 users Wetrained each classifier to build two profiles one profile forself-behavior using the first 50 blocks of the particular userand the other profile for non-self-behavior using (49 times 50)training blocks of the other 49 users The test set of each userwill be the same as described in Section 31

312 SEA 1v49 In this configuration we followed the samemethodology proposed in research [9] A classifier is built foreach user and trained only with the first 50 training blocks ofits data On the other hand the test set for each user consistsof the first 50 training blocks of each of the other 49 usersresulting in 2450masquerade blocks in addition to its originalnormal blocks which vary between 76 and 100 blocks

32 Greenberg Dataset This dataset has been proposed in[16] and widely used in previous works It contains com-mands collected from 168 UNIX users that used csh shellUsers of this dataset are considered to be a member in one ofthe following four groups novice programmers experiencedprogrammers computer scientists and nonprogrammersThis dataset is enriched ie it has sessions for each userincluding information about start and end time of the sessionworking directory command names command parameterscommand aliases and an error flag Two associated dataconfigurations have been used with this dataset in theliterature Greenberg Truncated and Greenberg Enriched

321 Greenberg Truncated In this configuration we fol-lowed the same methodology conducted by [17] First weextracted the truncated command lines from Greenbergdataset which contain only the command names Next from168 users available inGreenberg dataset we selected randomly50 users who have between 2000 and 5000 commands to actas normal users Then we divided commands of each of the50 users into blocks each with 10 commands The first 100blocks of each user will be his training set whereas the next100 blocks will be used as a validation of self-behavior in histest set After that we randomly selected additional 25 usersfrom the remaining 118 users to act as masqueraders Thenfor each of the 50 normal users we selected randomly 30

blocks from masqueradersrsquo data and input them at randompositions in his test set which results in a total of 130 blocksfor testing

322 Greenberg Enriched It has the same methodologyexplained in Greenberg Truncated but with only one differ-ence that for this data configuration we extracted only theenriched command lines from Greenberg dataset Enrichedcommand linemeans a concatenation of command name andcommand parameters entered by the user together with anyalias employed As for Greenberg Truncated data configura-tion described above Greenberg Enriched data configurationhas for each of the 50 normal users 100 blocks for training and130 blocks for testing

33 PU Dataset Purdue University (PU) dataset has beenproposed in [19] It contains sanitized commands collectedfrom 8 different users at Purdue University over the courseof up to 2 years This dataset is enriched which meansthat it contains in addition to command names commandparameters flags and shell meta-characters Furthermorethis dataset has sessions for each of the 8 users In addition tothat data of each user is processed into a token stream Tokenhere means either command name or command parameterTwo associated data configurations have been used with thisdataset in the literature PU Truncated and PU Enriched

331 PU Truncated For this configuration we followed thesame methodology used in [19] First we extracted onlythe truncated tokens from PU dataset ie the tokens thatcontain only command names Next for each of the 8 usersavailable in PU dataset we divided his data into blocks eachof 10 tokens Then the first 150 blocks of each user will beconsidered as his training set After that the next 50 blocksfor each user will be used as a validation of self-behavior inhis test set To simulate masquerade activities we added foreach user other seven usersrsquo testing data (7times 50)which resultsin a total of 400 blocks of testing for each of the 8 users

332 PU Enriched It has the same methodology explainedin PU Truncated but with only one difference that forPU Enriched data configuration we extracted here only theenriched tokens ie all tokens from PU dataset As forPU Truncated data configuration described in Section 331PU Enriched data configuration has for each of the 8 users150 blocks for training and 400 blocks for testing Table 3summarizes all details about data configurations

4 DNN Hyperparameters Selection

In this section we will present a Particle SwarmOptimization-based algorithm to select the hyperparametersof Deep Neural Networks (DNN) This algorithm will helpus to proceed in our experiments to construct DNN formasquerades detection as will be explained in Section 51DNN is a multilayer Artificial Neural Network with manyhidden layers The weights of DNN are fully connectedie every neuron at any particular layer is connected to allneurons of the higher-order layer that is located adjacently

Security and Communication Networks 5

Table 3 The structure of the used data configurations

Characteristics Data Configurations

SEA SEA 1v49 GreenbergTruncated

GreenbergEnriched

PUTruncated PU Enriched

Number of users 50 50 50 50 8 8Block Size 100 100 10 10 10 10

Number of blocks forevery user

Training set 2500 50 100 100 150 150Test set 100 2526sim2550 130 130 400 400Total 2600 2576sim2600 230 230 550 550

Number of blocks forall users

Training set 125000 2500 5000 5000 1200 1200Test set 5000 127269 6500 6500 3200 3200Total 130000 129769 11500 11500 4400 4400

Distribution of thetraining set

Normal 2500 2500 5000 5000 1200 1200Masquerader 122500 0 0 0 0 0

Total 125000 2500 5000 5000 1200 1200

Distribution of thetest set

Normal 4769 4769 5000 5000 400 400Masquerader 231 122500 1500 1500 2800 2800

Total 5000 127269 6500 6500 3200 3200

1

2

m

I1

I2

Im

Input LayerHidden Layers

Output Layer

1

2

j

1

2

k

1

2

n

1 h

O1

O2

On

2 h-1

Figure 1 The basic structure of a typical DNN

to that particular layer [4] The information in DNN ispropagated in a feed-forward manner that is from inputs tooutputs via hidden layers Figure 1 depicts the basic structureof a typical DNN

DNNs are widely used in various machine learning tasksIn addition to that they have proved their ability to surpassmost of the machine learning techniques in terms of perfor-mance [22] However the performance of any DNN relieson the selection of the values of its hyperparameters DNNhyperparameters are defined as a set of critical parametersthat control the architecture behavior and performance ofthat DNN in the underlying machine learning task Indeedthere are two kinds of such hyperparameters global parame-ters and layer-based parameters The global parameters arethose that defined the general behavior of DNN such aslearning rate epochs number batch size number of layers

and the used optimizer On the other hand layer-basedparameters values are dependent on each layer in DNNExamples of layer-based parameters are but not limitedto type of layer weight initialization method activationfunction and a number of neurons

The problem is that these hyperparameters are varyingfrom task to task and they must be set before the trainingprocess One familiar solution to overcome this problem isto find an expert who is conversant with the underlyingmachine learning task to tune precisely the DNN hyper-parameters Unfortunately the existence of such expert isnot available in all cases Another possible solution is toadjust these hyperparameters manually in a trial-and-errormanner This can be handled by searching the space ofhyperparameters by executing either grid search or randomsearch [23 24] A grid search is performed upon definedranges of hyperparameters where those ranges are identifiedpreviously depending on a prior knowledge of the underlyingtask After that the user picks up values of hyperparam-eters from the predefined ranges consecutively and teststhe performance of DNN on the training set When allpossible combination of hyperparameters values is testedthe best combination is selected to configure DNN andtest it on the test set Random search is similar to gridsearch but instead of picking up hyperparameters valuesin a methodical manner the user selects hyperparametersvalues from those predefined ranges randomly In 2012 Snoeket al have proposed a hyperparameters selection methodbased on Bayesian optimization [25] In this method theuser improves his knowledge of selecting hyperparametersby using the information gained from any given experimentto decide how to adjust the hyperparameters for the nextexperiment Despite good results that have been obtainedby the grid random and Bayesian optimization searchesin some cases in general the complexity and large search

6 Security and Communication Networks

space of theDNNhyperparameters valuesmake suchmanualalgorithms infeasible and too exhausting searching process

Evolutionary Algorithms (EAs) are metaheuristic algo-rithms which perform excellently for finding the globaloptima of a nonlinear function especially when there aremultiple local minima or maxima EAs are considered asvery promising algorithms for solving the problem of DNNparameterization automatically In the literature there are alot of studies that have been proposed recently aiming at usingEAs in optimizing DNN hyperparameters in order to gain ahigh accuracy value as much as possible Genetic Algorithm(GA) which is one of the most famous EAs has been usedto optimize the network parameters and the Taguchi methodis applied between the crossover and mutation operatorsincluding initial weights definition [26] GAs also are usedin the pretraining step prior to the supervised step based ona multiclass classification task [27] Another approach usingGA to reduce the training time has been presented in [28]TheGA is used to enhanceDeepNeuralNetworks by evolvinga neural networkrsquos weights [29] An automated GA-basedapproach has been proposed in [30] that optimized DNNhyperparameters for malware classification tasks MoreoverParticle Swarm Optimization is also one of the most well-known and popular EAs Lorenzo et al used PSO andproposed two approaches the first is sequential and thesecond is parallel to optimize hyperparameters of any DNN[31 32] Then Nalepa and Lorenzo proved formally theconvergence abilities of the former two approaches and testedthem separately on a single workstation and a cluster ofsequential and parallel approaches respectively [33] FinallyF Ye proposed in 2017 an automatic PSO-based algorithmto select DNN hyperparameters in large scale and highdimensional data [34]Thus we decided to use PSO to enableus to select hyperparameters for DNN automatically Thenin Section 51 we will explain how to adapt this algorithmfor static classification experiments used in a masqueradedetection scenario Section 41 introduces a necessary andbrief preface reviewing how standard PSO is working Thenthe rest of this section presents our proposed PSO-basedalgorithm to optimize DNN hyperparameters

41 Particle Swarm Optimization Particle Swarm Optimiza-tion (PSO) is a metaheuristic algorithm for optimizing non-linear functions in continuous search space It was proposedby Eberhart and Kennedy in 1995 [35] PSO tries to mimicthe social behavior of animals The swarm concept is a setof many members which are called particles The numberof particles in the swarm is an integer value denoted by119878 and called swarm size Every particle in the particularswarm has two vectors of 119873 length where 119873 is the sizeof the problem defined variables (dimensions) The firstvector is called position vector denoted by 119875 that identifiesthe current position of that particle in the search space ofthe problem Each position vector can be considered as acandidate solution of the problem The second vector iscalled velocity vector denoted by 119881 that determines bothspeed and direction of that particle in the search space ofthe problem at next iteration During the execution of PSOanother two vectors at every iteration should be stored The

first is called personal best vector denoted by 119875119894119887119890119904119905 whichindicates the best position of the 119894th particle in the swarmthat has been explored so far Each particle in the swarm hasits independent personal best vector from the other particlesand it is updated at each iteration The second vector is theglobal best vector denoted by Gbest which indicates the bestposition that has been found over the swarm so far There isa single global best vector for all particles in the swarm andit is updated at every iteration It can be looked to personalbest vector as the cognitive knowledge of the particle whereasthe global best vector represents the social knowledge ofthe swarm Mathematically for each particle 119894 in the swarm119878 at each iteration 119905 the velocity 119881 and position 119875 vectorsare updated to next iteration t+1 according to (1) and (2)respectively

119881119894119905+1 = 119882119881119894119905 + 11986211199031 (119905) (119875119894119887119890119904119905 minus 119875119894119905)+ 11986221199032 (119905) (119866119887119890119904119905 minus 119875119894119905)

(1)

119875119894119905+1 = 119875119894119905 + 119881119894119905+1 (2)

119882 is the inertia weight constant which controls the impactof the velocity of the particle at the current iteration onthe next iteration so the speed and direction of the particleare adjusted in order not to let the particle to get outsidethe search space of the problem Meanwhile 1198621 and 1198622 areconstants and known as acceleration coefficients 1199031 and 1199032are random values uniformly distributed in [0 1] At thebeginning of every iteration new values of 1199031 and 1199032 arecomputed randomly and they are constants for all particles inthe swarm at that iteration The goal of using 1198621 1198622 1199031 and1199032 constants is to scale both the cognitive knowledge of theparticle and the social knowledge of the swarmon the velocitychanges So the new position vectors of all particles willapproach to the optimal solution of the problem accordinglyFigure 2 depicts the flowchart of the standard PSO

In brief the standard PSOworks as follows First the userenters some required inputs like swarm size (S) dimensionsof the particles (N) acceleration constants (1198621 1198622) inertiaweight constant (W) fitness function (F) to score particleperformance in the problem domain and the maximumnumber of iterations (119905119898119886119909) Next PSO initializes positionand velocity vectors with the specified dimensions for allparticles in the swarm randomly Then PSO initializes thepersonal best vector for each particle in the swarm withthe specified dimensions and sets them to very small valueFurthermore PSO initializes the global best vector of theswarm with the specified dimensions and sets it to very smallvalue PSO computes the fitness score for each particle usingthe fitness function and updates the personal best vectorsfor all particles and the global best vector of the swarmAfter that PSO starts the first iteration by computing 1199031 and1199032 randomly and then updates velocity and position vectorsfor each particle according to (1) and (2) respectively Inaddition to that PSO computes again the fitness score foreach particle according to the given fitness function andupdates the personal best vector for each particle if the fitnessscore of that particle at this iteration is bigger than the fitness

Security and Communication Networks 7

YesNo

Start

(2) Initialize P and Vvectors particlesSof

each lengthNof

(5) For all S particles Compute F(P) and update Pi

best(6) Update Gbest

(8) Compute r1(t) and r2(t)(9) For all S particles

Update V P F(P) and Pibest

(10) Update Gbest

(12) Check Stop Criterion

satisfied

(13) Output Gbest as the optimal solution

Terminate

maxWF t

(1) Input SN C1 C2 (3) Pibest larr minusinfin i larr 1 to S

(4) Gbest larr minusinfin

(7) t larr 1

(11) t larr t+1

Figure 2 The flowchart of the standard PSO

score of the personal best vector of that particle (119865(119875119894119905 ) gt119865(119875119894119887119890119904119905)) Also PSO updates the global best vector of theswarm if any of the fitness score of the personal best vectorof the particles is bigger than the fitness score of the globalbest vector of the swarm (119865(119875119894119887119890119904119905) gt 119865(119866119887119890119904119905) i=1 to S)Then PSO checks the stop criterion and if one is satisfiedPSO will output the global best vector as the optimal solutionand terminate Else PSO will proceed to the next iterationand repeat the same procedure described in the first iterationabove until the stop criterion is reached

The stop criterion is satisfied when either the trainingerror is smaller than a predefined value () or the maximumnumber of iteration is reached Finally PSO performs betterthan GA in terms of simplicity and generality [36] PSO issimpler than GA because it contains only one operator andeasy to implement Also the generality of PSO means thatPSO does not need any modifications to be applied to anyoptimization problem as well as it is faster to converge to theoptimal solutionwhich decreases the computations and savesthe resources

42 DNN Hyperparameters Selection Using PSO The selec-tion of the hyperparameters of DNN can be interpreted as anoptimization task hence the main objective is to minimizethe loss function L(MT) where 119872 is the DNN model and119879 is the training set To achieve this goal we selected PSOto be our optimization algorithm that outputs the vectorof the optimized hyperparameters 119867 that minimized theloss function 119871 after constructed DNN model 119872 which istuned by the hyperparameters 119867 and trained on the trainingset 119879 The fitness function of our PSO-based algorithm isa function 119865lowast 119877119873 997888rarr 119877 that maps a real-valued vectorof hyperparameters that has a length of N to a real-valuedaccuracy value of the trained DNN that is tuned by thathyperparameters vector and tested on the test set 119885 Inother words our PSO-based algorithm finds the optimalhyperparameters vector among all possible combinations ofhyperparameters which yields to maximize the accuracy ofthe trained DNN on the test set Furthermore to ensurethe generality of our PSO-based algorithm which meansto be independent of the DNN that will be optimized andbe adapted easily to any classification task using DNN wewill allow the user to select which hyperparameters want touse in his work Therefore the user is responsible for usingour algorithm to define the number of the hyperparameters

as well as the type and domain of each parameter Thedomain of a parameter is the set of all possible values ofthat parameter After that our PSO-based algorithm willuse a special built-in generator that depends on the numberand domains of the defined parameters to initialize all theparticles (hyperparameters vectors) in the swarm

During the execution of the proposed algorithm andat each iteration the validation process is involved in theproposed algorithm to validate the updated position andvelocity vectors to be appropriate to the predefined rangesof parameters Finally in order to reduce computations andconverge faster two different stop conditions are checkedsimultaneously at the end of each iteration The first occurswhen the fitness score of the global best vector increasedless than a threshold which is specified by the userThe aim of the former condition is to guarantee that theglobal best vector cannot be improved further even if themaximumnumber of iterations is not reached yetThe secondcondition happens when the maximum number of iterationsis carried out Either the first or the second condition issatisfied then the proposed algorithm outputs the global bestvector as the optimal solution 119867 and terminates the searchprocess Figure 3 shows the flowchart of our PSO-basedDNNhyperparameters selection algorithm

43 Algorithm Steps

Inputs Number of hyperparameters (N) swarm size (S)acceleration constants (1198621 1198622) inertia constant (W) max-imum value of velocity (119881119898119886119909) minimum value of velocity(V119898119894119899) maximum number of iterations (t119898119886119909) evolutionthreshold () training set (T) and test set (Z)Output The optimal solution HProcedure

Step 1 For klarr9978881 to NLet h119896 be the k119905ℎ hyperparameterIf domain of h119896 is continuous then

let 119861119896119897119900119908 be the lower bound of h119896 and 119861119896119906119901be the upper bound of h119896

let user enter the lower and upper boundsof a hyperparameter h119896

End of if

8 Security and Communication Networks

(4) Initialize P and V vectors of Sparticles each of N length

(8) For all S particles

(12) For all S particles(16) Output

Yes

Terminate

Start User

(2) Define Domains for hk

(3) Create Hyper-parameters amp velocity generator

(1) Preprocessing Phase (2) Initialization Phase (3) Evolution Phase (4) Finishing Phase

No (15) Check Stop conditions

satisfied

(1) Input N S Vmin Vmax

klarr1 to N

(5) Input T Z C1 C2 W tmax

(6) Pibest larrminusinfin i larr1 to S(7) Gbest larr minusinfin

Compute Flowast(P) and update Pibest

(9) Update Gbest

(10) tlarr1

Compute V P Flowast(P) and Pibest

(13) Update Gbest

(14) tlarrt+1

(11) Compute r1(t) and r2(t)H larr Gbest

Figure 3 The flowchart of the proposed algorithm

Else

Let Y119896 be the set of all possible values of h119896

Let user enter all elements of the set Y119896

End of elseEnd of for

Step 2 Let 119865lowast be the fitness function which constructs DNNtuned with the given hyperparameters then trains DNN on119879 and tests it on 119885 Finally 119865lowast computes the accuracy ofDNN as output

Step 3 Let G119887119890119904119905 be the global best vector of the swarm oflength N

Let GS be the best fitness score of the swarmGSlarr997888 minusinfin

Step 4 For ilarr9978881 to SLet P119894 be the position vector of the 119894th particle oflength NLet V 119894 be the velocity vector of the 119894th particle oflength NLet 119875119894119887119890119904119905 be the personal best vector of the 119894thparticle of length NLet PS119894 be the fitness score of the personal bestvector of the 119894th particleFor jlarr9978881 to N

If domain of h119895 is continuous thenselect h119895 uniformly distributed

119875[119895] larr997888 U(119861119895119897119900119908

119861119895119906119901)End of ifElse

Select h119895 randomly by 119875119894[j] larr997888RAND (Y119895)

End of else119881119894[119895] larr997888 U(119881119898119894119899 119881119898119886119909)

End of for119875119894119887119890119904119905 larr997888 119875119894Let FS119894 be the fitness score of the 119894th particle

119865119878119894 larr997888 119865lowast(119875119894)119875119878119894 larr997888 119865119878119894If FS119894 gt GS then

119866119887119890119904119905 larr997888 119875119894119866119878 larr997888 119865119878119894

End of ifEnd of for

Step 5 Let GS119901119903119907 be the previous best fitness score of theswarm

119866119878119901119903V larr997888 119866119878Let 1199031 and 1199032 be random values in PSOLet 119905 be the current iterationFor tlarr9978881 to t119898119886119909

1199031 larr997888 119880(0 1)1199032 larr997888 119880(0 1)For ilarr997888 1 to S

Update V 119894 according to (1)Update P119894 according to (2)119865119878119894 larr997888 119865lowast(119875119894)If FS119894 gt PS119894 then119904119904119904119875119894119887119890119904119905 larr997888 119875119894119875119878119894 larr997888 119865119878119894End of ifIf PS119894 gt GS then119866119887119890119904119905 larr997888 119875119894119887119890119904119905119866119878 larr997888 119875119878119894End of if

End of forIf 119866119878- 119866119878119901119903V lt then

go to Step 6End of if

Security and Communication Networks 9

Table 4 PSO parameters recommended values or ranges

Parameter ValueRange119878 [5 20]119881119898119894119899 0119881119898119886119909 11198621 21198622 2119882 [04 09]119905119898119886119909 [30 50] 00001

119866119878119901119903V larr997888 119866119878End of for

Step 6 Let119867 be the optimal hyperparameters vector119867 larr997888 119866119887119890119904119905Return119867 and Terminate

44 PSO Parameters Selection of the value of PSO param-eters (S V119898119886119909 V119898119894119899 1198621 1198622 W t119898119886119909 ) is a very complexprocess Fortunatelymany empirical and theoretical previousstudies have been published to solve this problem [37ndash40] They introduced some recommended values of PSOparameters which can be taken Table 4 shows every PSOparameter and the corresponding recommended value orrange Thus for those parameters which have recommendedranges we can select a value for each parameter from its rangerandomly and fix it as a constant during the execution of PSO

5 Experimental Setup and Models

This section explains the methodology of performing ourempirical experiments as well as the description of deeplearning models which we used to detect masquerades Asmentioned in Section 3 we selected three UNIX commandline-based datasets (SEA Greenberg PU) Each of thesedatasets is a collection of text files inwhich each text file repre-sents a userThe text file of each user in the particular datasetcontains a set of UNIX commands that are issued by that userThis reflects the fact that these datasets do not contain anyreal masqueraders However to simulate masqueraders andto use these datasets in masquerade detection special dataconfigurations must be implemented prior to proceeding inour experiments According to Section 3 and its subsectionseach dataset has its two different types of data configurationsTherefore we obtained six data configurations that each onewill be observed separately which yields in the result to sixindependent experiments for each model Finally masquer-ade detection can be applied to these data configurationsby following two different main approaches namely staticclassification and dynamic classificationThe two subsequentsubsections present the difference between them as well aswhich deep learning models are exploited for each one

51 Static Classification Approach In the static classificationapproach the classification task is carried out using a dataset

of samples which are represented by a set of static features[30] These static features are defined according to the natureof the task where the classification will be applied In additionto that the dataset samples or also called observations arecollected manually by some experts working in the field ofthat classification task After that these samples are split intotwo independent sets known as training and test sets to trainand test the selected model respectively Static classificationapproach has pros and cons as well Although it provides afaster and easier solution it requires a ready-to-use datasetwith static features The existence of such dataset might notbe available in some complex classification tasks Hence theattempt to create a dataset with static features will be a hardmission In our work we decided to utilize the existenceof three famous UNIX command line-based datasets toimplement six different data configurations Each user inthe particular data configuration has a specific number ofblocks which are represented by a set of static featuresIndeed these features are the userrsquos UNIX commands incharge of describing the behavior of that user and laterhelping the classifier to detect masquerades We decided touse two well-known deep learning models namely DeepNeural Networks (DNN) and Recurrent Neural Networks(RNN) to accomplish the staticmasquerade detection task onthe implemented six data configurations

511 Deep Neural Networks In Section 4 we explained indetail the DNN structure and the problem of the selection ofits hyperparameters We also proposed PSO-based algorithmto obtain the optimal hyperparameters vector thatmaximizedthe accuracy of the DNN on the given training and test setsIn this subsection we describe how we utilized the proposedPSO-based algorithm and the DNN in static masqueradedetection task using six of data configurations which areSEA SEA 1v49 Greenberg Truncated Greenberg EnrichedPU Truncated and PU Enriched Every data configurationof them has its structure and a specific number of users asdescribed in Section 3 So we will have six separate DNN-experiments and each experiment will be on one of the dataconfigurations

The methodology of our DNN-experiments consists offour consecutive stages which are initialization optimiza-tion results extraction and finishing stages The first stageis to initialize all required operating parameters as well asto prepare the particular data configurationrsquos files in whicheach file represents a user in that data configurationThe userfile consists of the training set followed by the test set of thatuser We set all PSO parameters for all DNN-experiments asfollows S=20 V119898119894119899=0 V119898119886119909= 1 1198621=1198622=2 W=09 t119898119886119909=30and =10minus4 Then the last step in the initialization stage is todefine hyperparameters of the DNN and their domains Weused twelve different DNN hyperparameters (N=12) Table 5shows each DNN hyperparameter and its correspondingdefined domain All the used hyperparameters are numericalexcept that Optimizer Layer type Initialization function andActivation function hyperparameters are categorical In thiscase a list of all possible values is indexed to a sequenced-numbered range from 1 to the length of that list Optimizerlist includes elements Adagrad Nadam Adam Adamax

10 Security and Communication Networks

Table 5 The used DNN hyperparameters and their domains

Hyperparameter Domain DescriptionLearning rate [001 09] ContinuousMomentum [01 09] ContinuousDecay [0001 001 ContinuousDropout rate [01 09] ContinuousNumber of hidden layers [1 10] Discrete with step=1Numbers of neurons of hidden layers [1 100] Discrete with step=1Number of epochs [5 20] Discrete with step=5Batch size [100 1000] Discrete with step=50Optimizer [1 6] Discrete with step=1Initialization function [1 8] Discrete with step=1Layer type [1 2] Discrete with step=1Activation function [1 8] Discrete with step=1

RMSprop and SGD Layer type list contains two elementswhich are Dropout and Dense Initialization function listincludes elements Zero Normal Lecun uniform UniformGlorot uniform Glorot normal He uniform and He normalFinally Activation list has eight elements which are LinearSoftmax ReLU Sigmoid Tanh Hard Sigmoid Softsign andSoftplus It is worth mentioning that the elements of all cate-gorical hyperparameters are defined inKeras implementation[30]

The optimization and results extraction stages will beperformed once for each user in the particular data configu-ration that is they will be repeated for each user119880119894 i=12 M where 119872 is the number of users in the particular dataconfiguration119863The optimization stage starts by splitting thedata of the user119880119894 into two independent sets119879119894 and119885119894 whichare the training and test sets of the ith user respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 All blocks ofthe training and test sets are converted from text to numericvalues and then are normalized in [0 1] After that wesupplied these sets to the proposed PSO-based algorithm tofind the optimized hyperparameters vector119867119894 for the ith userIn addition to that we will save a copy of 119867119894 values in adatabase in order to save time and use them again in theRNN-experiment of that particular data configuration D aswill be presented in Section 512 The results extraction stagetakes place when constructing the DNN that is tuned by 119867119894trains the DNN on 119879119894 and tests the DNN on119885119894 The values ofthe classification outcomes True Positive (TP119894) False Positive(FP119894) True Negative (TN 119894) and False Negative (FN 119894) for theith user in the particular data configuration 119863 are extractedand saved for further processing later

Then the next user is observed and same procedure ofoptimization and results extraction stages is performed tillthe last user in the particular data configuration119863 is reachedFinally when all users in the particular data configurationare completed the last stage (finishing stage) is executedFinishing stage computes the summation of all obtained TPsof all users in the particular data configuration 119863 denotedby TP The same process will be applied also to the otheroutcomes namely FP TN and FN Equations (3) (4)

(5) and (6) express the formulas of TP FP TN and FNrespectively

119879119875 = 119872sum119894=1

119879119875119894 (3)

119865119875 = 119872sum119894=1

119865119875119894 (4)

119879119873 = 119872sum119894=1

119879119873119894 (5)

119865119873 = 119872sum119894=1

119865119873119894 (6)

The finishing stage will report and save these outcomes andend the DNN-experiment for the particular data configura-tion 119863 The former outcomes will be used to compute tenwell-known evaluation metrics to assess the performanceof the DNN on the particular data configuration D as willbe presented in Section 6 It is worth saying that the sameprocedure which is explained above will be done for eachdata configuration Figure 4 depicts the flowchart of themethodology of the DNN-experiments

512 Recurrent Neural Networks TheRecurrent Neural Net-work is a special type of the traditional feed-forwardArtificialNeural Network Unlike traditional ANN in the RNN eachneuron in any of the hidden layers has additional connectionsfrom its output to itself (self-recurrent) as well as to otherneurons of the same hidden layer Therefore the output ofthe RNNrsquos hidden layer at any time step (t) is for the currentinputs and the output of the hidden layer at the previous timestep (t-1) In RNN these directed cycles allow informationto circulate in the network and make the hidden layers asthe storage unit of the whole network [41] The importantcharacteristics of the RNN are the capability to have memoryand generate periodical sequences

Despite that the conventional RNN structure which isdescribed above has a serious problem especially when the

Security and Communication Networks 11

(9) Construct DNN that is tuned by Hi

(10) Train DNN on Ti

(11) Test DNN on Zi

No

(16) OutputTP FP TN and FN

Yes

End

Start

(1) Input Data configuration D M

(2) Set PSO parameters values

(3) Define Domains for Hyper-parameters

(3) Results Extraction Stage (4) Finishing Stage(1) Initialization Stage (2) Optimization Stage

(6) Execute the proposed PSO-based algorithm

(15) Compute and save TP FP TN and FN for D

(8) Database

(4) ilarr1

(7) Obtain Hi of the user Ui

(5) Create Ti and Zi sets of the user Ui

Hi

(12) Obtain and save TPi FPi TNi andFNi for the user Ui

(14) Is i gt M

(13) ilarri+1

Figure 4 The flowchart of the DNN-experiments

Inputxt

it ctℎt

Outputot

ft

Figure 5 The structure of an LSTM cell [6]

RNN is trained using the back-propagation technique Theproblem is known as gradient vanishing and exploding [42]The gradient vanishing problem occurs when the gradientsignal gets so small over the network which causes learningto become very slow or stop On the other hand the gradientexploding problem occurs when the gradient signal gets solarge in which learning divergesThis problem of the conven-tional RNN limited the use of the RNN to be only in short-termmemory tasks To solve this problem a new architectureof RNN is proposed by Hochreiter and Schmidhuber [43]known as Long Short-Term Memory (LSTM) LSTM uses anew structure called a memory cell that is composed of fourparts which are an input gate a neuron with a self-recurrentconnection a forget gate and the output gateMeanwhile themain goal of using a neuron with a self-recurrent connectionis to record information the aim of using three gates is tocontrol the flow of information from or into the memory cellThe input gate decides if to allow the incoming informationto enter into the memory cell or block it Moreover the forgetgate controls if to pass the previous state of the memory cellto alter the current state of the memory cell or prevent itFinally the output gate determines if to pass the output ofthe memory cell or not Figure 5 shows the structure of anLSTM memory cell Rather than overcoming the problemsof the conventional RNN LSTM model also outperformsthe conventional RNN in terms of performance especially inlong-term memory tasks [5] The LSTM-RNN model can beobtained by replacing every neuron in the hidden layers ofthe RNN to an LSTMmemory cell [6]

In this study we used the LSTM-RNN model to performa static masquerade detection task on all data configurationsAs mentioned in Section 511 there are six data config-urations and each of them will be used in the separate

experiment So we will have six separate LSTM-RNN-experiments each experiment will be on one of the dataconfigurations The methodology of all of these experimentsis the same and as follows for the given data configurationD we firstly prepared all the given data configurationrsquos filesby converting all blocks from text to numerical values andthen normalizing them in [0 1] Next to that for each user119880119894 in D where i=12 M and 119872 is the number of users inD we did the following steps we split the data of 119880119894 into twoindependent sets 119879119894 and 119885119894 which are the training and testsets of the ith user in D respectively The splitting processfollowed the structure of the particular data configurationwhich is described in Section 3 After that we retrieved thestored optimized hyperparameters vector of the ith user (119867119894)from the database which is created in the previous DNN-experiments Then we constructed the RNN model that istuned by119867119894 In order to obtain the LSTM-RNNmodel everyneuron in any of the hidden layers is replaced to an LSTMmemory cell The constructed LSTM-RNN model is trainedon119879119894 and then tested on119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 6 depicts theflowchart of the methodology of LSTM-RNN-experiments

52 Dynamic Classification Approach In contrast of staticclassification approach dynamic classification approach doesnot need a ready-to-use dataset with static features [30] Itcovenants directly with raw data sources such as text imagevideo sound and signal files and extracts features from themdynamically The models that use this approach try to learnand represent features in unsupervised manner Then thesemodels train themselves using the extracted features to beable to classify unseen dataThe deep learningmodels fit verywell for this approach because the main objectives of deeplearning models are the strong ability of automatic featureextraction and self-learning Rather than that dynamicclassification models overcome the problem of the lake ofdatasets it performs more efficient than the static classifica-tionmodels Despite these advantages dynamic classificationapproach has also drawbacks Dynamic classification modelsare slower and take a long time to train if compared with

12 Security and Communication Networks

YesNo

Hi

Start

(1) InputData configuration D M

(2) Prepare files of D

(4) Split data of Ui

into Ti and Zi sets

(7) Train LSTM-RNN model on Ti

(8) Test LSTM-RNN model on Zi

End

(5) Database

(6) Construct LSTM-RNN model that is tuned by Hi

(3) ilarr1

(9) Obtain and save TPi FPi TNi andFNi for the user Ui

(10) ilarri+1

(11) Is i gt M

(13) Output TPFP TN and FN

(12) Compute andsave TP FP TN

and FN for D

Figure 6 The flowchart of the LSTM-RNN-experiments

static classification models due to complex deep structure ofthesemodels as well as the huge amount of computations thatare required to execute Furthermore dynamic classificationmodels require a very large amount of input samples to gainhigh accuracy values

In this research we used six data configurations that areimplemented from three textual datasets In order to applydynamic masquerade detection on these data configurationswe need amodel that is able to extract features from the userrsquoscommand text file dynamically and then classify the user intoone of the two classes that will be either a normal user or amasqueraderTherefore we dealwith a text classification taskThe text classification is defined as a task that assigns a pieceof text (a word a sentence or even a document) to one ormore classes according to its content Indeed there are threetypes of text classification namely sentence classificationsentiment analysis and document categorization In sentenceclassification a given sentence should be assigned correctlyto one of possible classes Furthermore sentiment analysisdetermines if a given sentence is a positive negative orneutral towards a specific subject In contrast documentcategorization deals with documents and determines whichclass from a given set of possible classes a document belongsto According to the nature of dynamic classification as well asthe functionality of text classification deep learning modelsare the fittest among the other machine learning models forthese types of classification due to their powerful capability offeatures learning

A wide range of researches have been accomplished inthe literature in the field of text classification using deeplearning models It was started by LeCun et al in 1998 whenthey proposed a special topology of the Convolutional NeuralNetwork (CNN) known as LeNet family and used it in textclassification efficiently [44]Then various studies have beenpublished to introduce text classification algorithms as wellas the factors that impact the performance [45ndash47] In thestudy [48] the CNNmodel is used for sentence classificationtask over a set of text dataset benchmarks A single one-dimensional CNN is proposed to learn a region-based textembedding [49] X Zhang et al introduced a novel character-based multidimensional CNN for text classification taskswith competitive results [50] In the research [51] a newhierarchal approach calledHierarchal Deep Learning for Text

classification (HDLTex) is proposed and three deep struc-tures which are DNN RNN and CNN are used A recurrentconvolutional network model is introduced [52] for textclassification and high results are obtained on documents-level datasets A novel LSTM-based model is introduced andused for text classification withmultitask learning framework[53] The study [54] proposed a new model called hierarchalattention network for document classification and is testedon six large document-level datasets with good results Acharacter-level text representations approach is proposed andtested for text classification tasks using deep CNN [55]As noticed the CNN is the mostly used deep learningmodel for text classification tasks So we decided to use theCNN to perform dynamic masquerade detection on all dataconfigurations The following subsection reviews the CNNand explains the structure of the used CNN model and themethodology of our CNN-experiments

521 Convolutional Neural Networks The ConvolutionalNeural Network (CNN) is a deep learning model whichis biological-inspired from the animal visual cortex TheCNN can be considered as a special type of the traditionalfeed-forwardArtificial Neural NetworkThemajor differencebetween ANN and CNN is that instead of the fully connectedarchitecture of ANN the individual neurons in CNN areconnected to subregions of the input field The neurons ofthe CNN are arranged in such a way they are tilled to coverthe entire input field The typical CNN consists of five maincomponents namely an input layer the convolutional layerthe pooling layer the fully connected layer and an outputlayer The input layer is where the input data is enteredinto the CNN The first convolutional layer in the CNNconsists of individual neurons that each of them is connectedto a small subset of the input field The neurons in thenext convolutional layers connect only to a subset of theirpreceding pooling layerrsquos outputMoreover the convolutionallayers in the CNN use a set of learnable kernels or filters thateach filter is applied to the specified subset of their precedinglayerrsquos output These filters calculate feature maps in whicheach feature map shares the same weights The poolinglayer also known as a subsampling layer is a nonlineardownsampling function that condenses subsets of its inputThemain goal of using pooling layers in the CNN is to reduce

Security and Communication Networks 13

Userrsquos Command Text Files

Quantization

Input Layer

Convolutional layer

C1 features map P1 features map

Max-Pooling layer

C2 P2 C6 P6

Fully-Connected dropout layers

2048 sigmoid neurons

2048 sigmoid neurons 2

softmaxneurons

Outputdense layer

0 (Normal)1 (Masquerader)

Figure 7 The architecture of the used CNNmodel

the complexity and computations by reducing the size of theirpreceding layerrsquos output There are many pooling nonlinearfunctions that can be used but among them max-poolingis the mostly used which selects the maximum value in thegiven pooling window Typically each convolutional layer inthe CNN is followed by a max-pooling layer The CNN hasone or more stacked convolutional layer and max-poolinglayer pairs to extract features from the entire input and thenmap these features to their next fully connected layerThe toplayers of the CNN are one or more of fully connected layerswhich are similar to hidden layers in the DNN This meansthat neurons of the fully connected layers are connected to allneurons of the preceding layer The output layer is the finallayer in the CNN and is responsible for reporting the outputvalue of the CNN Finally the back-propagation algorithm isusually used to train CNNs via Stochastic Gradient Decent(SGD) to adjust the weights of the fully connected layers [56]There are several variant structures of CNN that are proposedin the literature but LeNet structure which is proposed byLeCun et al [44] is themost common approach used inmanyapplications of computer vision and text classification

Regarding its stability and high efficiency in text clas-sification we selected the CNN model which is proposedin [50] to perform a dynamic masquerade detection on alldata configurationsThe usedmodel is a character-level CNNthat takes a text file as input and outputs the classificationscore (0 if the input text file is related to a normal user or1 otherwise) The used CNN model is from LeNet familyand consists of an input layer followed by six convolutionand max-pooling pairs followed by two fully connectedlayers and finally followed by an output layer In the inputlayer the text quantization process takes place when theused model encodes all letters in the input text file using aone-hot representation from a 70-character alphabet All theconvolutional layers in the used CNN model have a ReLUnonlinear activation functionThe two fully connected layersin the used CNN model are of the type dropout layer withdropout probability equal to 05 In addition to that the twofully connected layers in the usedCNNmodel have a Sigmoidnonlinear activation function as well as they have the samesize of 2048 neurons of each The output layer in the usedCNN model is of the type dense layer as well as it has asoftmax activation function and size of two neurons Theused CNN model is trained by back-propagation algorithmvia SGD Finally we set the following parameters to the

used CNN model learning rate=001 epochs=30 and batchsize=64 These values are obtained experimentally by per-forming a grid search to find the best possible values of theseparameters Figure 7 shows the architecture of the used CNNmodel and is reproduced from Zhang et al (2015) [under theCreative Commons Attribution Licensepublic domain]

In our work we used a CNNmodel to perform a dynamicmasquerade detection task on all data configurations Asmentioned in Section 511 there are six data configurationsand each of them will be used in the separate experimentSo we will have six separate CNN-experiments and eachexperiment will be on one of the data configurations Themethodology of all of these experiments is the same and asfollows for the given data configurationD we firstly preparedall the given data configurationrsquos text files such that each file ofthem represents the training and test sets of a user in119863 Nextto that for each user 119880119894 in D where i=12 M and119872 is thenumber of users in D we did the following steps we split thedata of 119880119894 into two independent sets 119879119894 and 119885119894 which are thetraining and test sets of the ith user in D respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 Furthermorewe also moved each block in the training and test sets of theuser 119880119894 to a separate text file This means that each of thetraining and test sets of the user 119880119894 consists of a specifiednumber of text files in which each text file contains one blockof UNIX commands After that we constructed the usedCNN model The constructed CNN model is trained on 119879119894and then tested on 119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 8 depicts theflowchart of the methodology of CNN-experiments

6 Results and Discussion

We carried out three major empirical experiments whichareDNN-experiments LSTM-RNN-experiments andCNN-experiments Each of them consists of six separate subex-periments where each subexperiment is performed on oneof the data configurations SEA SEA 1v49 Greenberg Trun-cated Greenberg Enriched PU Truncated and PU Enriched

14 Security and Communication Networks

YesNo

Start

(1) Input

(2) Prepare text files of D

(4) Split data of Ui

Ti and Zi text sets(6) Construct the used CNN model

(7) Train CNN model on Ti

(8) Test CNN model on Zi

(13) Output TP FP TN and FNEnd

(5) Move each block in Ti and Zi to a separate text file

Data configuration D M

(3) ilarr1

(12) Compute and save TPFP TN and FN for D

(9) Obtain and save TPi FPi TNiand FNi for the user Ui

(11) Is i gt M

(10) ilarri+1

into

Figure 8 The flowchart of the CNN-experiments

Table 6 The confusion matrix of the masquerade detection out-comes

Actual Class Predicted ClassNormal User Masquerader

Normal User TN FPMasquerader FN TP

Basically our PSO-based DNN hyperparameters selectionalgorithmwas implemented in Python 364 [57]withNumPy[58] Moreover all models (DNN LSTM-RNN CNN) wereconstructed and trained and tested based on Keras [59 60]with TensorFlow 16 [61 62] that backend over CUDA 90[63] and cuDNN 70 [64] In addition to that all experimentswere performed on a workstation with an Intel Core i7 CPU(38GHz 16 MB Cache) 16GB of RAM and theWindows 10operating system In order to accelerate the computations inall experiments we also used a GPU-accelerated computingwith NVIDIA Tesla K20 GPU 5GB GDDR5The experimen-tal environment is processed in 64-bit mode

In any classification task we have four possible outcomesTrue Positive (TP) True Negative (TN) False Positive (FP)and False Negative (FN) We get a TP when a masqueraderis correctly classified as a masquerader Whenever a gooduser is correctly classified as a good user itself we say it isa TN A FP occurs when a good user is misclassified as amasquerader In contrast FN occurs when a masqueraderis misclassified as a good user Table 6 shows the ConfusionMatrix of the masquerade detection outcomes For eachdata configuration we used the obtained outcomes for thatdata configuration to compute twelve well-known evaluationmetrics After that by using these evaluation metrics weassessed the performance of each deep learningmodel on thatdata configuration

For simplicity we divided these evaluation metrics intotwo categories General Classification Measures and Mas-querade Detection Measures The General ClassificationMeasures are metrics that are used for any classification tasknamely Accuracy Precision Recall and F1-Score On theother handMasquerade DetectionMeasures are metrics thatusually are used for a masquerade or intrusion detection

task which are Hit Rate Miss Rate False Alarm RateCost Bayesian Detection Rate Bayesian True Negative RateGeometric Mean andMatthews Correlation CoefficientTheused evaluation metrics definition and their correspondingequations are as follows

(i) Accuracy shows the rate of true detection over all testsets

119860119888119888119906119903119886119888119910 = 119879119875 + 119879119873119879119875 + 119879119873 + 119865119875 + 119865119873 (7)

(ii) Precision shows the rate of correctly classified mas-queraders from all blocks in the test set that areclassified as masqueraders

119875119903119890119888119894119904119894119900119899 = 119879119875119879119875 + 119865119875 (8)

(iii) Recall shows the rate of correctly classified masquer-aders over all masquerader blocks in the test set

119877119890119888119886119897119897 = 119879119875119879119875 + 119865119873 (9)

(iv) F1-Score gives information about the accuracy of aclassifier regarding both Precision (P) and Recall (R)metrics

1198651 119878119888119900119903119890 = 21119875 + 1119877 (10)

(v) Hit Rate shows the rate of correctly classified mas-querader blocks over all masquerader blocks pre-sented in the test set It is also called Hits TruePositive Rate or Detection Rate

119867119894119905 119877119886119905119890 = 119879119875119879119875 + 119865119873 (11)

(vi) Miss Rate is the complement of Hit Rate (Miss=100-Hit) ie it shows the rate of masquerade blocksthat are misclassified as a normal user from allmasquerade blocks in the test set It is also calledMisses or False Negative Rate

119872119894119904119904 119877119886119905119890 = 119865119873119865119873 + 119879119875 (12)

Security and Communication Networks 15

(vii) False Alarm Rate (FAR) gives information about therate of normal user blocks that are misclassified as amasquerader over all normal user blocks presented inthe test set It is also called False Positive Rate

119865119886119897119904119890 119860119897119886119903119898 119877119886119905119890 = 119865119875119865119875 + 119879119873 (13)

(viii) Cost is a metric that was proposed in [9] to evaluatethe efficiency of a classifier concerning bothMiss Rate(MR) and False Alarm Rate (FAR) metrics

119862119900119904119905 = 119872119877 + 6 times 119865119860119877 (14)

(ix) Bayesian Detection Rate (BDR) is a metric basedon Base-Rate Fallacy problem which is addressedby S Axelsson in 1999 [65] Base-Rate Fallacy is abasis of Bayesian statistics and occurs when peo-ple do not take the basic rate of incidence (Base-Rate) into their account when solving problems inprobabilities Unlike Hit Rate metric BDR shows therate of correctly classified masquerader blocks overall test set taking into consideration the base-rate ofmasqueraders Let I and Ilowast denote a masquerade anda normal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectivelyThen BDR can be computed asthe probability P(I | A) according to (15) [65]119861119886119910119890119904119894119886119899 119863119890119905119890119888119905119894119900119899 119877119886119905119890 = 119875 (119868 | 119860)

= 119875 (119868) times 119875 (119860 | 119868)119875 (119868) times 119875 (119860 | 119868) + 119875 (119868lowast) times 119875 (119860 | 119868lowast)(15)

P(I) is the rate of the masquerader blocks in the testset P(A | I) is the Hit Rate P(Ilowast) is the rate of thenormal blocks in the test set and P(A | Ilowast) is the FAR

(x) Bayesian True Negative Rate (BTNR) is also basedon Base-Rate Fallacy and shows the rate of trulyclassified normal blocks over all test set in which thepredicted normal behavior indicates really a normaluser [65] Let I and Ilowast denote a masquerade and anormal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectively Then BTNR can be computedas the probability P(Ilowast | Alowast) according to (16) [65]

119861119886119910119890119904119894119886119899 119879119903119906119890 119873119890119892119886119905119894V119890 119877119886119905119890 = 119875 (119868lowast | 119860lowast)= 119875 (119868lowast) times 119875 (119860lowast | 119868lowast)

119875 (119868lowast) times 119875 (119860lowast | 119868lowast) + 119875 (119868) times 119875 (119860lowast | 119868)(16)

P(Ilowast) is the rate of the normal blocks in the test setP(Alowast | Ilowast) is the True Negative Rate which is easilyobtained by calculating (1-FAR) P(I) is the rate of themasquerader blocks in the test set and P(Alowast | I) isthe Miss Rate

(xi) Geometric Mean (g-mean) is a performance metricthat combines true negative rate and true positive

rate at one specific threshold where both the errorsare considered equal This metric has been usedby several researchers for evaluating classifiers onimbalance dataset [66] It can be computed accordingto (17) [67]

119892 119898119890119886119899 = radic 119879119875 times 119879119873(119879119875 + 119865119873) times (119879119873 + 119865119875) (17)

(xii) Matthews Correlation Coefficient (MCC) is a perfor-mance metric that takes into account true and falsepositives and negatives and is generally regarded asa balanced measure which can be used even if theclasses are of very different sizes (imbalance dataset)[68] MCC has a range of minus1 to 1 where minus1 indicates acompletely wrong binary classifier while 1 indicates acompletely correct binary classifier Unlike the othermetrics discussed aboveMCC takes all the cells of theConfusion Matrix into consideration in its formulawhich can be computed according to (18) [69]

119872119862119862= (119879119875 times 119879119873) minus (119865119875 times 119865119873)radic(119879119875 + 119865119873) times (119879119875 + 119865119875) times (119879119873 + 119865119875) times (119879119873 + 119865119873)

(18)

In the following two subsections we will present our experi-mental results and explain them using two kinds of analysesperformance analysis and ROC curves analysis

61 Performance Analysis The effectiveness of any modelto detect masqueraders depends on its values of evaluationmetrics The higher values of Accuracy Precision RecallF1-Score Hit Rate Bayesian Detection Rate Bayesian TrueNegative Rate Geometric Mean and Matthews CorrelationCoefficient as well as the lower values of Miss Rate FalseAlarm Rate and Cost indicate an efficient classifierThe idealclassifier hasAccuracy andHit Rate values that reach 1 as wellasMiss Rate and False AlarmRate values that reach 0 Table 7presents the percentages of the used evaluation metricsfor DNN-experiments LSTM-RNN-experiments and CNN-experiments Actually the rows labeled by DNN and LSTM-RNN in Table 7 show results of the static masquerade detec-tion by using DNN and LSTM-RNN models respectivelywhereas the rows labeled by CNN in Table 7 show resultsof the dynamic masquerade detection by using CNN modelFurthermore the bold rows represent the best results amongthe same data configuration whereas the underlined valuesare the best for all data configurations

First of all the impact of using our PSO-based algorithmcan be seen in the obtained results of both DNN and LSTM-RNN models The PSO-based algorithm is used to optimizethe selection of DNN hyperparameters that maximized theaccuracy which means that the sum of TP and TN outcomeswill be increased significantly Thus according to (11) and(13) increasing the sum of TP and TN will lead definitelyto the increase of the value of Hit as well as to the decreaseof the value of FAR Although the accuracy values of SEA1v49 data configuration for all models are slightly lower than

16 Security and Communication Networks

Table 7 The results of our experiments

Dataset DataConfiguration Model Evaluation Metrics ()

Accuracy Precision Recall F1-Score Hit Miss FAR Cost BDR BTNR g-mean MCC

SEA Dataset

SEADNN 9808 7626 8485 8033 8485 1515 128 2283 7625 9926 9152 7945

LSTM-RNN 9852 8230 8658 8439 8658 1342 090 1883 8233 9934 9263 8364CNN 9884 8777 8701 8739 8701 1299 059 1651 8772 9937 93 8678

SEA 1v49DNN 9654 9998 9643 9817 9643 357 048 647 9998 5204 9796 7064

LSTM-RNN 9786 9998 9779 9887 9779 221 038 448 9998 6370 987 7874CNN 9878 9999 9874 9936 9874 126 019 240 9999 7551 9927 8622

GreenbergDataset

GreenbergTruncated

DNN 9397 9223 8067 8606 8067 1933 204 3157 9222 9441 8889 8253LSTM-RNN 9472 9488 8153 8770 8153 1847 132 2639 9487 9468 897 8476

CNN 9543 9616 8353 8940 8353 1647 10 2247 9616 9524 9094 8686

GreenbergEnriched

DNN 9757 9692 9240 9461 9240 760 088 1288 9692 9775 957 9308LSTM-RNN 9798 9757 9360 9554 9360 640 070 1060 9756 9810 9641 9428

CNN 9860 9855 9533 9692 9533 467 042 719 9855 9861 9743 9603

PU Dataset

PU TruncatedDNN 810 9959 7861 8786 7861 2139 225 3489 9959 3949 8766 5463

LSTM-RNN 8219 9969 7989 8870 7989 2011 175 3061 9968 4110 886 5646CNN 8375 9974 8164 8979 8164 1836 150 2736 9973 4338 8968 5879

PU EnrichedDNN 9044 9984 8921 9423 8921 1079 10 1679 9984 5672 9398 7064

LSTM-RNN 9131 9988 9018 9478 9018 982 075 1432 9988 5908 9461 7261CNN 9375 9992 9293 9630 9293 707 050 1007 9992 6678 9616 7852

the corresponding values of SEA data configuration also Hitvalues are dramatically increased in SEA 1v49 for all modelsby 10-14 from those that are in the SEA data configurationThis is due to the structure of SEA 1v49 data configurationwhere there are 122500 masquerader blocks in the test setof SEA 1v49 comparing to only 231 blocks in the SEA dataconfiguration Moreover the FAR values of SEA 1v49 for allmodels are significantly lower than the corresponding valuesof SEA data configuration Hence regarding SEA datasetSEA 1v49 is better to use in masquerade detection than SEAdata configuration

On the other hand as we expected Greenberg Enrichedenhanced noticeably the performance of all models in termsof all used evaluation metrics from the corresponding val-ues of Greenberg Truncated data configuration This canbe explained by the fact that Greenberg Enriched dataconfiguration has more information about user behaviorincluding command name parameters aliases and flagscomparing to only command name in Greenberg TruncatedTherefore regarding Greenberg dataset Greenberg Enricheddata configuration is better to use in masquerade detectionthan Greenberg Truncated The same thing happened inPU dataset where its PU Enriched data configuration hasbetter results regarding all models than PU Truncated Thusregarding PU dataset PU Enriched is better to use inmasquerade detection than PUTruncated data configuration

Actually PU Truncated and Greenberg Truncated dataconfigurations simulate SEA and SEA 1v49 data configu-rations where only command name is considered Despitethat regarding all used models SEA 1v49 recorded thebest results among the other truncated data configurationsOn the other hand PU Enriched and Greenberg Enriched

are considered as enriched data configurations where extrainformation about users is taken into consideration Due tothat enriched data configurations help models to build userrsquosbehavior profile more accurately than with truncated dataconfigurations Regarding all models the results associatedwithGreenberg Enriched especially in terms ofAccuracyHitand FAR values are better than of the corresponding valuesof PU Enriched data configuration because PU dataset isvery small masquerade detection dataset with a relatively lownumber of users (only 8 users) Also this reason can explainwhy a few previous works used PU dataset in masqueradedetection However data configurations can be sort for allused models from the upper to lower according to theobtained results as follows SEA 1v49 Greenberg EnrichedPU Enriched SEA Greenberg Truncated and PUTruncated

For the sake of brevity and space limitation we selected asubset of the used performancemetrics inTable 7 to be shownvisually in Figures 9 and 10 Figures 9(a) 9(b) 9(c) 9(d)9(e) 9(f) 9(g) and 9(h) showAccuracy HitMiss FAR CostBDR F1-Score and MCC percentages of the used modelsin each data configuration respectively Figures 10(a) 10(b)10(c) 10(d) 10(e) and 10(f) show Accuracy Hit FAR BDRF1-Score and MCC percentages for the average performanceof the used models on datasets respectively Figures 9 and10 can give us a visual comparison of the performance of theused deep learning models for each data configuration anddataset as well as in all datasets

By taking an inspective look to Figures 9 and 10 we cannotice the stability of deep learning models in such a waythat they are enhancing masquerade detection from a dataconfiguration to another in a consistent pattern To explainthat we will discuss the obtained results from the perspective

Security and Communication Networks 17

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnriched

PU EnrichedPU Truncated

0102030405060708090

100

Accura

cy (

)

(a)

0102030405060708090

100

Hit

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(b)

0

5

10

15

20

25

Miss

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(c)

002040608

112141618

22224

FAR

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(d)

0

5

10

15

20

25

30

35

Cos

t (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(e)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU Truncated

0102030405060708090

100

BDR

()

PU Enriched

(f)

Figure 9 Continued

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 2: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

2 Security and Communication Networks

problem and limitation to static classification applications[4] In 2006 a new concept of representation learning basedon Artificial Neural Network called deep learning has beenput forwardDeep learning is considered as a class ofmachinelearning techniques that has in hierarchical architecturesmany layers of information processing stages for patternrecognition or classification Rather than overcoming theformer deficiencies of shallow machine learning methods itachieves recently great success in many research fields Themain advantages of deep learning can be summarized asits practicability having the ability to unsupervised featurelearning or extraction from datasets and having strong self-learning capability [5] There are four typical models ofdeep learning namely Autoencoder (AE) Deep Belief Net-works (DBN) Convolutional Neural Networks (CNN) andRecurrent Neural Networks (RNN) Because of deep learningsuccess and stability it has been actively and continually usedin a wide range of applications nowadays such as computervision natural language processing and intrusion detectionsystems [4 5]

To our knowledge neither of the previous studies onthe area of the masquerade detection used deep learning toutilize its great capabilities and various learning models Theaim and contribution of this research are sixfold as follows(i) we performed a comprehensive empirical study whichinvestigates the effectiveness of three binary classificationdeep learning models to detect masqueradersrsquo attacks (ii)the first study uses three famous UNIX command linedatasets with their different (six) data configurations andcompares between them (iii) we proposed a Particle SwarmOptimization-based algorithm for DNN hyperparametersselection (iv) we carried out our experiments on all dataconfigurations using both static and dynamic masqueradedetection approaches (v) we assessed the performance ofthe used deep learning models using twelve well-knownevaluation metrics Wilcoxon and Friedman statistical testsand ROC analysis (vi) we made comparisons between deeplearning modelsrsquo results and the best results of the traditionalmachine learning methods that have been published in theliterature in the field of masquerade detection

The rest of this paper is organized as follows Section 2reviews the relatedwork that has been published previously inthe area of masquerade detection using traditional machinelearning methods and UNIX command line datasets ThenSection 3 describes UNIX command line datasets and theirdata configurations in detail Section 4 presents a ParticleSwarm Optimization-based algorithm to select hyperparam-eters of Deep Neural Networks (DNN) Section 5 shows howour experiments are established and what deep learningmodels are used and Section 6 presents which evaluationmetrics are used as well as it analyzes the gained experimentalresults Finally Section 7 presents our conclusions and possi-ble future work

2 Related Work

Masquerade detection has been actively researched in thelast decade due to its significance and vulnerability to thecomputer security area For the sake of brevity and restriction

of the scope of this study we have principally focused onanomaly-based masquerade detection using machine learn-ing approaches and well-known UNIX command line-baseddatasets in the literature

It was firstly introduced by Schonlau et al [7] whenthey proposed a UNIX command line-based dataset calledSEA They also utilized various statistical methods on SEAdata configuration and compared the results In short timeSEA dataset becomes very popular in the field of anomaly-based masquerade detection techniques T Okamoto et al[8] presented an immunity-based Hidden Markov Modelon SEA data configuration and they got 60 Hit and 1False Alarm Rate (FAR) Naive Bayes is a famous classifierthat is working well with text classification tasks It wasfirst applied on SEA data configuration by Roy A Maxionand Tahlia N Townsend in 2002 [9] with two models onewith updating users profile (Hit=615 FAR=13) and theother with no-updating (Hit=662 FAR=46) Moreoverthey proposed a new data configuration from SEA datasetnamed SEA 1v49 and also tested Naive Bayes classifier withupdating on SEA 1v49 data configuration and they had 628Hit and 46 FAR K Wang et al in [10] implemented onSEA data configuration a Naive Bayes classifier (Hit=70FAR=2) andOne-Class Support VectorMachine (OCSVM)model (Hit=70 FAR=4) In the study [11] K H Yungpresented in his work a Naive Bayes classifier with updatingand feedback which has applied to SEA data configuration(Hit=76 FAR=2) He developed his previous work andproposed a self-consistent Naive Bayes model with updatingon SEA data configuration in 2004 [12] He had better resultsand increased Hit to 79 but FAR is still 2

Support Vector Machine (SVM) is also a well-knownmachine learning method that is used for both classificationand regression Chen and Aritsugi introduced a SVM-basedmethod formasquerade detectionwith online updating usingEigen Cooccurrence Matrix which is applied to SEA dataconfiguration [13] They tested their proposed method forOne-Class (Hit=6277 FAR=6) as well as for Two-Class(Hit=7224 FAR=3) classification models In 2006 ZLi et al extracted user behaviorrsquos principle features fromCorrelation Eigen Matrix using Principle Component Anal-ysis (PCA) then they fed these features to SVM-basedmasquerade detection system on SEAdata configuration [14]They got a very good result with Hit=826 and FAR=3H S Kim and S D Cha performed an empirical studyin the field of masquerade detection using SVM classifierwith a voting engine [15] They tested their SVM classifieron two UNIX command line-based datasets namely SEAdataset and Greenberg dataset [16] which latter is proposedby Greenberg in 1988 For SEA dataset they applied theirSVM classifier on two different data configurations namelySEA data configuration (Hit=801 FAR=97) and SEA1v49 data configuration (Hit=948 FAR=0) In additionto that they applied their SVM classifier on two differentdata configurations for Greenberg dataset namely Green-berg Truncated and Greenberg Enriched data configurationswhich are proposed byMaxion [17] ForGreenbergTruncateddata configuration they had Hit=711 and FAR=6 mean-while they had Hit=873 and FAR=64 for Greenberg

Security and Communication Networks 3

Table 1 Best results of the related works

Model Dataset Configuration Hit () FAR ()HMM SEA SEA 60 1

Naive BayesSEA SEA 79 2

SEA 1v49 628 46

Greenberg Greenberg Truncated 709 47Greenberg Enriched 821 57

Conditional NaiveBayes

SEA SEA 84 88SEA 1v49 907 1

Greenberg Greenberg Enriched 8413 94PU PU Enriched 84 8

SVM

SEA SEA 826 3SEA 1v49 948 0

Greenberg Greenberg Truncated 711 6Greenberg Enriched 873 64

PU PU Enriched 60 2Tree-based PU PU Enriched 85 10

Table 2 Datasets and their characteristics

Dataset Name HostsPlatform

No ofUsers

AuditFormat Enriched Contaminated Sessions Real

Masquerades Year

SEA Unix 50 UnixCommands No Yes No No 2001

Greenberg Unix 168 UnixCommands Yes No Yes No 1988

PU Unix 8 UnixCommands Yes No Yes No 1997

Enriched data configuration In 2007 Yang et al presenteda One-Class SVM with string kernel classifier to detectmasquerade attacks [18] They tested their classifier on twoUNIX command line-based datasets namely SEA datasetand PU dataset [19] which latter is proposed by Lane andBrodley in 1997 For SEA dataset they applied their modelon SEA data configuration (Hit=62 FAR=15) and forPU dataset they applied their model on PU Enriched dataconfiguration (Hit=60 FAR=2) which is proposed [19]

In the study [17] a Naive Bayes model with updatingusers profile is introduced in 2003 on both Greenberg Trun-cated and Greenberg Enriched data configurations whereasGreenberg Truncated data configuration gave a Hit=709and a FAR=47 andGreenberg Enriched data configurationgave a Hit=821 and a FAR=57 Gebski and Wong [20]presented a tree-based model for masquerade detectionon PU Enriched data configuration (Hit=85 FAR=10)REDDY et al proposed a conditional Naive Bayes classifierto detect masquerades [21] They tested their classifier onthree different UNIX command line-based datasets namelySEA Greenberg and PU datasets For SEA dataset theyapplied their classifier on two data configurations namelySEA data configuration (Hit=84 FAR=88) and SEA 1v49data configuration (Hit=907 FAR=1) For Greenbergdataset they applied their classifier on Greenberg Enriched

data configuration (Hit=8413 FAR=94) Finally theytested their classifier on PU Enriched data configurationand they got a Hit=84 and a FAR=8 Table 1 presents asummarization of the best results of the previous works abovein terms of Hit percentage for each dataset As we can noticefrom Table 1 developing a masquerade detection models forhigher Accuracy and Hit as well as lower FAR values is still abig challenge

3 Datasets and Configurations

This section describes the datasets that we used in our studydata configurations and the methodology of training andtesting as well Indeed there are various mechanisms thatcould be used to collect information about each user tomodel his behavior and then build his normal profile such asuser command lines history graphical user interface (GUI)user file system navigation and system calls at the operatingsystem level In this paper we selected three datasets basedon UNIX command line history of users namely SEAGreenberg and PU Rather than being free and publiclyavailable on Internet they are the most commonly useddatasets in anomaly-based masquerade detection area soour results will be easily compared to previous ones Table 2shows datasets and their characteristics

4 Security and Communication Networks

31 SEA Dataset Recently published papers that focused onmasquerade detection area used this dataset SEA (SchonlauEt Al) is a free UNIX command line-based dataset [7] Theyused UNIX acct audit tool to collect commands from 50different users for several months SEA dataset contains aset of 15000 commands for every user and these commandscontain only command names issued by that user For eachuser the set of 15000 commands is divided into 150 blockseach with 100 commands The first 50 blocks for each userare considered genuine and used as a training set Theremaining 100 blocks of each user are considered as a testset Some of the test blocks are contaminated randomly withdata of other users ie each user has varying masqueraderblocks in his test set from 0 to 24 blocks Two associateddata configurations have been used with this dataset in theliterature SEA and SEA 1v49

311 SEA This data configuration is proposed in the study[7] A separate classifier is built for each of the 50 users Wetrained each classifier to build two profiles one profile forself-behavior using the first 50 blocks of the particular userand the other profile for non-self-behavior using (49 times 50)training blocks of the other 49 users The test set of each userwill be the same as described in Section 31

312 SEA 1v49 In this configuration we followed the samemethodology proposed in research [9] A classifier is built foreach user and trained only with the first 50 training blocks ofits data On the other hand the test set for each user consistsof the first 50 training blocks of each of the other 49 usersresulting in 2450masquerade blocks in addition to its originalnormal blocks which vary between 76 and 100 blocks

32 Greenberg Dataset This dataset has been proposed in[16] and widely used in previous works It contains com-mands collected from 168 UNIX users that used csh shellUsers of this dataset are considered to be a member in one ofthe following four groups novice programmers experiencedprogrammers computer scientists and nonprogrammersThis dataset is enriched ie it has sessions for each userincluding information about start and end time of the sessionworking directory command names command parameterscommand aliases and an error flag Two associated dataconfigurations have been used with this dataset in theliterature Greenberg Truncated and Greenberg Enriched

321 Greenberg Truncated In this configuration we fol-lowed the same methodology conducted by [17] First weextracted the truncated command lines from Greenbergdataset which contain only the command names Next from168 users available inGreenberg dataset we selected randomly50 users who have between 2000 and 5000 commands to actas normal users Then we divided commands of each of the50 users into blocks each with 10 commands The first 100blocks of each user will be his training set whereas the next100 blocks will be used as a validation of self-behavior in histest set After that we randomly selected additional 25 usersfrom the remaining 118 users to act as masqueraders Thenfor each of the 50 normal users we selected randomly 30

blocks from masqueradersrsquo data and input them at randompositions in his test set which results in a total of 130 blocksfor testing

322 Greenberg Enriched It has the same methodologyexplained in Greenberg Truncated but with only one differ-ence that for this data configuration we extracted only theenriched command lines from Greenberg dataset Enrichedcommand linemeans a concatenation of command name andcommand parameters entered by the user together with anyalias employed As for Greenberg Truncated data configura-tion described above Greenberg Enriched data configurationhas for each of the 50 normal users 100 blocks for training and130 blocks for testing

33 PU Dataset Purdue University (PU) dataset has beenproposed in [19] It contains sanitized commands collectedfrom 8 different users at Purdue University over the courseof up to 2 years This dataset is enriched which meansthat it contains in addition to command names commandparameters flags and shell meta-characters Furthermorethis dataset has sessions for each of the 8 users In addition tothat data of each user is processed into a token stream Tokenhere means either command name or command parameterTwo associated data configurations have been used with thisdataset in the literature PU Truncated and PU Enriched

331 PU Truncated For this configuration we followed thesame methodology used in [19] First we extracted onlythe truncated tokens from PU dataset ie the tokens thatcontain only command names Next for each of the 8 usersavailable in PU dataset we divided his data into blocks eachof 10 tokens Then the first 150 blocks of each user will beconsidered as his training set After that the next 50 blocksfor each user will be used as a validation of self-behavior inhis test set To simulate masquerade activities we added foreach user other seven usersrsquo testing data (7times 50)which resultsin a total of 400 blocks of testing for each of the 8 users

332 PU Enriched It has the same methodology explainedin PU Truncated but with only one difference that forPU Enriched data configuration we extracted here only theenriched tokens ie all tokens from PU dataset As forPU Truncated data configuration described in Section 331PU Enriched data configuration has for each of the 8 users150 blocks for training and 400 blocks for testing Table 3summarizes all details about data configurations

4 DNN Hyperparameters Selection

In this section we will present a Particle SwarmOptimization-based algorithm to select the hyperparametersof Deep Neural Networks (DNN) This algorithm will helpus to proceed in our experiments to construct DNN formasquerades detection as will be explained in Section 51DNN is a multilayer Artificial Neural Network with manyhidden layers The weights of DNN are fully connectedie every neuron at any particular layer is connected to allneurons of the higher-order layer that is located adjacently

Security and Communication Networks 5

Table 3 The structure of the used data configurations

Characteristics Data Configurations

SEA SEA 1v49 GreenbergTruncated

GreenbergEnriched

PUTruncated PU Enriched

Number of users 50 50 50 50 8 8Block Size 100 100 10 10 10 10

Number of blocks forevery user

Training set 2500 50 100 100 150 150Test set 100 2526sim2550 130 130 400 400Total 2600 2576sim2600 230 230 550 550

Number of blocks forall users

Training set 125000 2500 5000 5000 1200 1200Test set 5000 127269 6500 6500 3200 3200Total 130000 129769 11500 11500 4400 4400

Distribution of thetraining set

Normal 2500 2500 5000 5000 1200 1200Masquerader 122500 0 0 0 0 0

Total 125000 2500 5000 5000 1200 1200

Distribution of thetest set

Normal 4769 4769 5000 5000 400 400Masquerader 231 122500 1500 1500 2800 2800

Total 5000 127269 6500 6500 3200 3200

1

2

m

I1

I2

Im

Input LayerHidden Layers

Output Layer

1

2

j

1

2

k

1

2

n

1 h

O1

O2

On

2 h-1

Figure 1 The basic structure of a typical DNN

to that particular layer [4] The information in DNN ispropagated in a feed-forward manner that is from inputs tooutputs via hidden layers Figure 1 depicts the basic structureof a typical DNN

DNNs are widely used in various machine learning tasksIn addition to that they have proved their ability to surpassmost of the machine learning techniques in terms of perfor-mance [22] However the performance of any DNN relieson the selection of the values of its hyperparameters DNNhyperparameters are defined as a set of critical parametersthat control the architecture behavior and performance ofthat DNN in the underlying machine learning task Indeedthere are two kinds of such hyperparameters global parame-ters and layer-based parameters The global parameters arethose that defined the general behavior of DNN such aslearning rate epochs number batch size number of layers

and the used optimizer On the other hand layer-basedparameters values are dependent on each layer in DNNExamples of layer-based parameters are but not limitedto type of layer weight initialization method activationfunction and a number of neurons

The problem is that these hyperparameters are varyingfrom task to task and they must be set before the trainingprocess One familiar solution to overcome this problem isto find an expert who is conversant with the underlyingmachine learning task to tune precisely the DNN hyper-parameters Unfortunately the existence of such expert isnot available in all cases Another possible solution is toadjust these hyperparameters manually in a trial-and-errormanner This can be handled by searching the space ofhyperparameters by executing either grid search or randomsearch [23 24] A grid search is performed upon definedranges of hyperparameters where those ranges are identifiedpreviously depending on a prior knowledge of the underlyingtask After that the user picks up values of hyperparam-eters from the predefined ranges consecutively and teststhe performance of DNN on the training set When allpossible combination of hyperparameters values is testedthe best combination is selected to configure DNN andtest it on the test set Random search is similar to gridsearch but instead of picking up hyperparameters valuesin a methodical manner the user selects hyperparametersvalues from those predefined ranges randomly In 2012 Snoeket al have proposed a hyperparameters selection methodbased on Bayesian optimization [25] In this method theuser improves his knowledge of selecting hyperparametersby using the information gained from any given experimentto decide how to adjust the hyperparameters for the nextexperiment Despite good results that have been obtainedby the grid random and Bayesian optimization searchesin some cases in general the complexity and large search

6 Security and Communication Networks

space of theDNNhyperparameters valuesmake suchmanualalgorithms infeasible and too exhausting searching process

Evolutionary Algorithms (EAs) are metaheuristic algo-rithms which perform excellently for finding the globaloptima of a nonlinear function especially when there aremultiple local minima or maxima EAs are considered asvery promising algorithms for solving the problem of DNNparameterization automatically In the literature there are alot of studies that have been proposed recently aiming at usingEAs in optimizing DNN hyperparameters in order to gain ahigh accuracy value as much as possible Genetic Algorithm(GA) which is one of the most famous EAs has been usedto optimize the network parameters and the Taguchi methodis applied between the crossover and mutation operatorsincluding initial weights definition [26] GAs also are usedin the pretraining step prior to the supervised step based ona multiclass classification task [27] Another approach usingGA to reduce the training time has been presented in [28]TheGA is used to enhanceDeepNeuralNetworks by evolvinga neural networkrsquos weights [29] An automated GA-basedapproach has been proposed in [30] that optimized DNNhyperparameters for malware classification tasks MoreoverParticle Swarm Optimization is also one of the most well-known and popular EAs Lorenzo et al used PSO andproposed two approaches the first is sequential and thesecond is parallel to optimize hyperparameters of any DNN[31 32] Then Nalepa and Lorenzo proved formally theconvergence abilities of the former two approaches and testedthem separately on a single workstation and a cluster ofsequential and parallel approaches respectively [33] FinallyF Ye proposed in 2017 an automatic PSO-based algorithmto select DNN hyperparameters in large scale and highdimensional data [34]Thus we decided to use PSO to enableus to select hyperparameters for DNN automatically Thenin Section 51 we will explain how to adapt this algorithmfor static classification experiments used in a masqueradedetection scenario Section 41 introduces a necessary andbrief preface reviewing how standard PSO is working Thenthe rest of this section presents our proposed PSO-basedalgorithm to optimize DNN hyperparameters

41 Particle Swarm Optimization Particle Swarm Optimiza-tion (PSO) is a metaheuristic algorithm for optimizing non-linear functions in continuous search space It was proposedby Eberhart and Kennedy in 1995 [35] PSO tries to mimicthe social behavior of animals The swarm concept is a setof many members which are called particles The numberof particles in the swarm is an integer value denoted by119878 and called swarm size Every particle in the particularswarm has two vectors of 119873 length where 119873 is the sizeof the problem defined variables (dimensions) The firstvector is called position vector denoted by 119875 that identifiesthe current position of that particle in the search space ofthe problem Each position vector can be considered as acandidate solution of the problem The second vector iscalled velocity vector denoted by 119881 that determines bothspeed and direction of that particle in the search space ofthe problem at next iteration During the execution of PSOanother two vectors at every iteration should be stored The

first is called personal best vector denoted by 119875119894119887119890119904119905 whichindicates the best position of the 119894th particle in the swarmthat has been explored so far Each particle in the swarm hasits independent personal best vector from the other particlesand it is updated at each iteration The second vector is theglobal best vector denoted by Gbest which indicates the bestposition that has been found over the swarm so far There isa single global best vector for all particles in the swarm andit is updated at every iteration It can be looked to personalbest vector as the cognitive knowledge of the particle whereasthe global best vector represents the social knowledge ofthe swarm Mathematically for each particle 119894 in the swarm119878 at each iteration 119905 the velocity 119881 and position 119875 vectorsare updated to next iteration t+1 according to (1) and (2)respectively

119881119894119905+1 = 119882119881119894119905 + 11986211199031 (119905) (119875119894119887119890119904119905 minus 119875119894119905)+ 11986221199032 (119905) (119866119887119890119904119905 minus 119875119894119905)

(1)

119875119894119905+1 = 119875119894119905 + 119881119894119905+1 (2)

119882 is the inertia weight constant which controls the impactof the velocity of the particle at the current iteration onthe next iteration so the speed and direction of the particleare adjusted in order not to let the particle to get outsidethe search space of the problem Meanwhile 1198621 and 1198622 areconstants and known as acceleration coefficients 1199031 and 1199032are random values uniformly distributed in [0 1] At thebeginning of every iteration new values of 1199031 and 1199032 arecomputed randomly and they are constants for all particles inthe swarm at that iteration The goal of using 1198621 1198622 1199031 and1199032 constants is to scale both the cognitive knowledge of theparticle and the social knowledge of the swarmon the velocitychanges So the new position vectors of all particles willapproach to the optimal solution of the problem accordinglyFigure 2 depicts the flowchart of the standard PSO

In brief the standard PSOworks as follows First the userenters some required inputs like swarm size (S) dimensionsof the particles (N) acceleration constants (1198621 1198622) inertiaweight constant (W) fitness function (F) to score particleperformance in the problem domain and the maximumnumber of iterations (119905119898119886119909) Next PSO initializes positionand velocity vectors with the specified dimensions for allparticles in the swarm randomly Then PSO initializes thepersonal best vector for each particle in the swarm withthe specified dimensions and sets them to very small valueFurthermore PSO initializes the global best vector of theswarm with the specified dimensions and sets it to very smallvalue PSO computes the fitness score for each particle usingthe fitness function and updates the personal best vectorsfor all particles and the global best vector of the swarmAfter that PSO starts the first iteration by computing 1199031 and1199032 randomly and then updates velocity and position vectorsfor each particle according to (1) and (2) respectively Inaddition to that PSO computes again the fitness score foreach particle according to the given fitness function andupdates the personal best vector for each particle if the fitnessscore of that particle at this iteration is bigger than the fitness

Security and Communication Networks 7

YesNo

Start

(2) Initialize P and Vvectors particlesSof

each lengthNof

(5) For all S particles Compute F(P) and update Pi

best(6) Update Gbest

(8) Compute r1(t) and r2(t)(9) For all S particles

Update V P F(P) and Pibest

(10) Update Gbest

(12) Check Stop Criterion

satisfied

(13) Output Gbest as the optimal solution

Terminate

maxWF t

(1) Input SN C1 C2 (3) Pibest larr minusinfin i larr 1 to S

(4) Gbest larr minusinfin

(7) t larr 1

(11) t larr t+1

Figure 2 The flowchart of the standard PSO

score of the personal best vector of that particle (119865(119875119894119905 ) gt119865(119875119894119887119890119904119905)) Also PSO updates the global best vector of theswarm if any of the fitness score of the personal best vectorof the particles is bigger than the fitness score of the globalbest vector of the swarm (119865(119875119894119887119890119904119905) gt 119865(119866119887119890119904119905) i=1 to S)Then PSO checks the stop criterion and if one is satisfiedPSO will output the global best vector as the optimal solutionand terminate Else PSO will proceed to the next iterationand repeat the same procedure described in the first iterationabove until the stop criterion is reached

The stop criterion is satisfied when either the trainingerror is smaller than a predefined value () or the maximumnumber of iteration is reached Finally PSO performs betterthan GA in terms of simplicity and generality [36] PSO issimpler than GA because it contains only one operator andeasy to implement Also the generality of PSO means thatPSO does not need any modifications to be applied to anyoptimization problem as well as it is faster to converge to theoptimal solutionwhich decreases the computations and savesthe resources

42 DNN Hyperparameters Selection Using PSO The selec-tion of the hyperparameters of DNN can be interpreted as anoptimization task hence the main objective is to minimizethe loss function L(MT) where 119872 is the DNN model and119879 is the training set To achieve this goal we selected PSOto be our optimization algorithm that outputs the vectorof the optimized hyperparameters 119867 that minimized theloss function 119871 after constructed DNN model 119872 which istuned by the hyperparameters 119867 and trained on the trainingset 119879 The fitness function of our PSO-based algorithm isa function 119865lowast 119877119873 997888rarr 119877 that maps a real-valued vectorof hyperparameters that has a length of N to a real-valuedaccuracy value of the trained DNN that is tuned by thathyperparameters vector and tested on the test set 119885 Inother words our PSO-based algorithm finds the optimalhyperparameters vector among all possible combinations ofhyperparameters which yields to maximize the accuracy ofthe trained DNN on the test set Furthermore to ensurethe generality of our PSO-based algorithm which meansto be independent of the DNN that will be optimized andbe adapted easily to any classification task using DNN wewill allow the user to select which hyperparameters want touse in his work Therefore the user is responsible for usingour algorithm to define the number of the hyperparameters

as well as the type and domain of each parameter Thedomain of a parameter is the set of all possible values ofthat parameter After that our PSO-based algorithm willuse a special built-in generator that depends on the numberand domains of the defined parameters to initialize all theparticles (hyperparameters vectors) in the swarm

During the execution of the proposed algorithm andat each iteration the validation process is involved in theproposed algorithm to validate the updated position andvelocity vectors to be appropriate to the predefined rangesof parameters Finally in order to reduce computations andconverge faster two different stop conditions are checkedsimultaneously at the end of each iteration The first occurswhen the fitness score of the global best vector increasedless than a threshold which is specified by the userThe aim of the former condition is to guarantee that theglobal best vector cannot be improved further even if themaximumnumber of iterations is not reached yetThe secondcondition happens when the maximum number of iterationsis carried out Either the first or the second condition issatisfied then the proposed algorithm outputs the global bestvector as the optimal solution 119867 and terminates the searchprocess Figure 3 shows the flowchart of our PSO-basedDNNhyperparameters selection algorithm

43 Algorithm Steps

Inputs Number of hyperparameters (N) swarm size (S)acceleration constants (1198621 1198622) inertia constant (W) max-imum value of velocity (119881119898119886119909) minimum value of velocity(V119898119894119899) maximum number of iterations (t119898119886119909) evolutionthreshold () training set (T) and test set (Z)Output The optimal solution HProcedure

Step 1 For klarr9978881 to NLet h119896 be the k119905ℎ hyperparameterIf domain of h119896 is continuous then

let 119861119896119897119900119908 be the lower bound of h119896 and 119861119896119906119901be the upper bound of h119896

let user enter the lower and upper boundsof a hyperparameter h119896

End of if

8 Security and Communication Networks

(4) Initialize P and V vectors of Sparticles each of N length

(8) For all S particles

(12) For all S particles(16) Output

Yes

Terminate

Start User

(2) Define Domains for hk

(3) Create Hyper-parameters amp velocity generator

(1) Preprocessing Phase (2) Initialization Phase (3) Evolution Phase (4) Finishing Phase

No (15) Check Stop conditions

satisfied

(1) Input N S Vmin Vmax

klarr1 to N

(5) Input T Z C1 C2 W tmax

(6) Pibest larrminusinfin i larr1 to S(7) Gbest larr minusinfin

Compute Flowast(P) and update Pibest

(9) Update Gbest

(10) tlarr1

Compute V P Flowast(P) and Pibest

(13) Update Gbest

(14) tlarrt+1

(11) Compute r1(t) and r2(t)H larr Gbest

Figure 3 The flowchart of the proposed algorithm

Else

Let Y119896 be the set of all possible values of h119896

Let user enter all elements of the set Y119896

End of elseEnd of for

Step 2 Let 119865lowast be the fitness function which constructs DNNtuned with the given hyperparameters then trains DNN on119879 and tests it on 119885 Finally 119865lowast computes the accuracy ofDNN as output

Step 3 Let G119887119890119904119905 be the global best vector of the swarm oflength N

Let GS be the best fitness score of the swarmGSlarr997888 minusinfin

Step 4 For ilarr9978881 to SLet P119894 be the position vector of the 119894th particle oflength NLet V 119894 be the velocity vector of the 119894th particle oflength NLet 119875119894119887119890119904119905 be the personal best vector of the 119894thparticle of length NLet PS119894 be the fitness score of the personal bestvector of the 119894th particleFor jlarr9978881 to N

If domain of h119895 is continuous thenselect h119895 uniformly distributed

119875[119895] larr997888 U(119861119895119897119900119908

119861119895119906119901)End of ifElse

Select h119895 randomly by 119875119894[j] larr997888RAND (Y119895)

End of else119881119894[119895] larr997888 U(119881119898119894119899 119881119898119886119909)

End of for119875119894119887119890119904119905 larr997888 119875119894Let FS119894 be the fitness score of the 119894th particle

119865119878119894 larr997888 119865lowast(119875119894)119875119878119894 larr997888 119865119878119894If FS119894 gt GS then

119866119887119890119904119905 larr997888 119875119894119866119878 larr997888 119865119878119894

End of ifEnd of for

Step 5 Let GS119901119903119907 be the previous best fitness score of theswarm

119866119878119901119903V larr997888 119866119878Let 1199031 and 1199032 be random values in PSOLet 119905 be the current iterationFor tlarr9978881 to t119898119886119909

1199031 larr997888 119880(0 1)1199032 larr997888 119880(0 1)For ilarr997888 1 to S

Update V 119894 according to (1)Update P119894 according to (2)119865119878119894 larr997888 119865lowast(119875119894)If FS119894 gt PS119894 then119904119904119904119875119894119887119890119904119905 larr997888 119875119894119875119878119894 larr997888 119865119878119894End of ifIf PS119894 gt GS then119866119887119890119904119905 larr997888 119875119894119887119890119904119905119866119878 larr997888 119875119878119894End of if

End of forIf 119866119878- 119866119878119901119903V lt then

go to Step 6End of if

Security and Communication Networks 9

Table 4 PSO parameters recommended values or ranges

Parameter ValueRange119878 [5 20]119881119898119894119899 0119881119898119886119909 11198621 21198622 2119882 [04 09]119905119898119886119909 [30 50] 00001

119866119878119901119903V larr997888 119866119878End of for

Step 6 Let119867 be the optimal hyperparameters vector119867 larr997888 119866119887119890119904119905Return119867 and Terminate

44 PSO Parameters Selection of the value of PSO param-eters (S V119898119886119909 V119898119894119899 1198621 1198622 W t119898119886119909 ) is a very complexprocess Fortunatelymany empirical and theoretical previousstudies have been published to solve this problem [37ndash40] They introduced some recommended values of PSOparameters which can be taken Table 4 shows every PSOparameter and the corresponding recommended value orrange Thus for those parameters which have recommendedranges we can select a value for each parameter from its rangerandomly and fix it as a constant during the execution of PSO

5 Experimental Setup and Models

This section explains the methodology of performing ourempirical experiments as well as the description of deeplearning models which we used to detect masquerades Asmentioned in Section 3 we selected three UNIX commandline-based datasets (SEA Greenberg PU) Each of thesedatasets is a collection of text files inwhich each text file repre-sents a userThe text file of each user in the particular datasetcontains a set of UNIX commands that are issued by that userThis reflects the fact that these datasets do not contain anyreal masqueraders However to simulate masqueraders andto use these datasets in masquerade detection special dataconfigurations must be implemented prior to proceeding inour experiments According to Section 3 and its subsectionseach dataset has its two different types of data configurationsTherefore we obtained six data configurations that each onewill be observed separately which yields in the result to sixindependent experiments for each model Finally masquer-ade detection can be applied to these data configurationsby following two different main approaches namely staticclassification and dynamic classificationThe two subsequentsubsections present the difference between them as well aswhich deep learning models are exploited for each one

51 Static Classification Approach In the static classificationapproach the classification task is carried out using a dataset

of samples which are represented by a set of static features[30] These static features are defined according to the natureof the task where the classification will be applied In additionto that the dataset samples or also called observations arecollected manually by some experts working in the field ofthat classification task After that these samples are split intotwo independent sets known as training and test sets to trainand test the selected model respectively Static classificationapproach has pros and cons as well Although it provides afaster and easier solution it requires a ready-to-use datasetwith static features The existence of such dataset might notbe available in some complex classification tasks Hence theattempt to create a dataset with static features will be a hardmission In our work we decided to utilize the existenceof three famous UNIX command line-based datasets toimplement six different data configurations Each user inthe particular data configuration has a specific number ofblocks which are represented by a set of static featuresIndeed these features are the userrsquos UNIX commands incharge of describing the behavior of that user and laterhelping the classifier to detect masquerades We decided touse two well-known deep learning models namely DeepNeural Networks (DNN) and Recurrent Neural Networks(RNN) to accomplish the staticmasquerade detection task onthe implemented six data configurations

511 Deep Neural Networks In Section 4 we explained indetail the DNN structure and the problem of the selection ofits hyperparameters We also proposed PSO-based algorithmto obtain the optimal hyperparameters vector thatmaximizedthe accuracy of the DNN on the given training and test setsIn this subsection we describe how we utilized the proposedPSO-based algorithm and the DNN in static masqueradedetection task using six of data configurations which areSEA SEA 1v49 Greenberg Truncated Greenberg EnrichedPU Truncated and PU Enriched Every data configurationof them has its structure and a specific number of users asdescribed in Section 3 So we will have six separate DNN-experiments and each experiment will be on one of the dataconfigurations

The methodology of our DNN-experiments consists offour consecutive stages which are initialization optimiza-tion results extraction and finishing stages The first stageis to initialize all required operating parameters as well asto prepare the particular data configurationrsquos files in whicheach file represents a user in that data configurationThe userfile consists of the training set followed by the test set of thatuser We set all PSO parameters for all DNN-experiments asfollows S=20 V119898119894119899=0 V119898119886119909= 1 1198621=1198622=2 W=09 t119898119886119909=30and =10minus4 Then the last step in the initialization stage is todefine hyperparameters of the DNN and their domains Weused twelve different DNN hyperparameters (N=12) Table 5shows each DNN hyperparameter and its correspondingdefined domain All the used hyperparameters are numericalexcept that Optimizer Layer type Initialization function andActivation function hyperparameters are categorical In thiscase a list of all possible values is indexed to a sequenced-numbered range from 1 to the length of that list Optimizerlist includes elements Adagrad Nadam Adam Adamax

10 Security and Communication Networks

Table 5 The used DNN hyperparameters and their domains

Hyperparameter Domain DescriptionLearning rate [001 09] ContinuousMomentum [01 09] ContinuousDecay [0001 001 ContinuousDropout rate [01 09] ContinuousNumber of hidden layers [1 10] Discrete with step=1Numbers of neurons of hidden layers [1 100] Discrete with step=1Number of epochs [5 20] Discrete with step=5Batch size [100 1000] Discrete with step=50Optimizer [1 6] Discrete with step=1Initialization function [1 8] Discrete with step=1Layer type [1 2] Discrete with step=1Activation function [1 8] Discrete with step=1

RMSprop and SGD Layer type list contains two elementswhich are Dropout and Dense Initialization function listincludes elements Zero Normal Lecun uniform UniformGlorot uniform Glorot normal He uniform and He normalFinally Activation list has eight elements which are LinearSoftmax ReLU Sigmoid Tanh Hard Sigmoid Softsign andSoftplus It is worth mentioning that the elements of all cate-gorical hyperparameters are defined inKeras implementation[30]

The optimization and results extraction stages will beperformed once for each user in the particular data configu-ration that is they will be repeated for each user119880119894 i=12 M where 119872 is the number of users in the particular dataconfiguration119863The optimization stage starts by splitting thedata of the user119880119894 into two independent sets119879119894 and119885119894 whichare the training and test sets of the ith user respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 All blocks ofthe training and test sets are converted from text to numericvalues and then are normalized in [0 1] After that wesupplied these sets to the proposed PSO-based algorithm tofind the optimized hyperparameters vector119867119894 for the ith userIn addition to that we will save a copy of 119867119894 values in adatabase in order to save time and use them again in theRNN-experiment of that particular data configuration D aswill be presented in Section 512 The results extraction stagetakes place when constructing the DNN that is tuned by 119867119894trains the DNN on 119879119894 and tests the DNN on119885119894 The values ofthe classification outcomes True Positive (TP119894) False Positive(FP119894) True Negative (TN 119894) and False Negative (FN 119894) for theith user in the particular data configuration 119863 are extractedand saved for further processing later

Then the next user is observed and same procedure ofoptimization and results extraction stages is performed tillthe last user in the particular data configuration119863 is reachedFinally when all users in the particular data configurationare completed the last stage (finishing stage) is executedFinishing stage computes the summation of all obtained TPsof all users in the particular data configuration 119863 denotedby TP The same process will be applied also to the otheroutcomes namely FP TN and FN Equations (3) (4)

(5) and (6) express the formulas of TP FP TN and FNrespectively

119879119875 = 119872sum119894=1

119879119875119894 (3)

119865119875 = 119872sum119894=1

119865119875119894 (4)

119879119873 = 119872sum119894=1

119879119873119894 (5)

119865119873 = 119872sum119894=1

119865119873119894 (6)

The finishing stage will report and save these outcomes andend the DNN-experiment for the particular data configura-tion 119863 The former outcomes will be used to compute tenwell-known evaluation metrics to assess the performanceof the DNN on the particular data configuration D as willbe presented in Section 6 It is worth saying that the sameprocedure which is explained above will be done for eachdata configuration Figure 4 depicts the flowchart of themethodology of the DNN-experiments

512 Recurrent Neural Networks TheRecurrent Neural Net-work is a special type of the traditional feed-forwardArtificialNeural Network Unlike traditional ANN in the RNN eachneuron in any of the hidden layers has additional connectionsfrom its output to itself (self-recurrent) as well as to otherneurons of the same hidden layer Therefore the output ofthe RNNrsquos hidden layer at any time step (t) is for the currentinputs and the output of the hidden layer at the previous timestep (t-1) In RNN these directed cycles allow informationto circulate in the network and make the hidden layers asthe storage unit of the whole network [41] The importantcharacteristics of the RNN are the capability to have memoryand generate periodical sequences

Despite that the conventional RNN structure which isdescribed above has a serious problem especially when the

Security and Communication Networks 11

(9) Construct DNN that is tuned by Hi

(10) Train DNN on Ti

(11) Test DNN on Zi

No

(16) OutputTP FP TN and FN

Yes

End

Start

(1) Input Data configuration D M

(2) Set PSO parameters values

(3) Define Domains for Hyper-parameters

(3) Results Extraction Stage (4) Finishing Stage(1) Initialization Stage (2) Optimization Stage

(6) Execute the proposed PSO-based algorithm

(15) Compute and save TP FP TN and FN for D

(8) Database

(4) ilarr1

(7) Obtain Hi of the user Ui

(5) Create Ti and Zi sets of the user Ui

Hi

(12) Obtain and save TPi FPi TNi andFNi for the user Ui

(14) Is i gt M

(13) ilarri+1

Figure 4 The flowchart of the DNN-experiments

Inputxt

it ctℎt

Outputot

ft

Figure 5 The structure of an LSTM cell [6]

RNN is trained using the back-propagation technique Theproblem is known as gradient vanishing and exploding [42]The gradient vanishing problem occurs when the gradientsignal gets so small over the network which causes learningto become very slow or stop On the other hand the gradientexploding problem occurs when the gradient signal gets solarge in which learning divergesThis problem of the conven-tional RNN limited the use of the RNN to be only in short-termmemory tasks To solve this problem a new architectureof RNN is proposed by Hochreiter and Schmidhuber [43]known as Long Short-Term Memory (LSTM) LSTM uses anew structure called a memory cell that is composed of fourparts which are an input gate a neuron with a self-recurrentconnection a forget gate and the output gateMeanwhile themain goal of using a neuron with a self-recurrent connectionis to record information the aim of using three gates is tocontrol the flow of information from or into the memory cellThe input gate decides if to allow the incoming informationto enter into the memory cell or block it Moreover the forgetgate controls if to pass the previous state of the memory cellto alter the current state of the memory cell or prevent itFinally the output gate determines if to pass the output ofthe memory cell or not Figure 5 shows the structure of anLSTM memory cell Rather than overcoming the problemsof the conventional RNN LSTM model also outperformsthe conventional RNN in terms of performance especially inlong-term memory tasks [5] The LSTM-RNN model can beobtained by replacing every neuron in the hidden layers ofthe RNN to an LSTMmemory cell [6]

In this study we used the LSTM-RNN model to performa static masquerade detection task on all data configurationsAs mentioned in Section 511 there are six data config-urations and each of them will be used in the separate

experiment So we will have six separate LSTM-RNN-experiments each experiment will be on one of the dataconfigurations The methodology of all of these experimentsis the same and as follows for the given data configurationD we firstly prepared all the given data configurationrsquos filesby converting all blocks from text to numerical values andthen normalizing them in [0 1] Next to that for each user119880119894 in D where i=12 M and 119872 is the number of users inD we did the following steps we split the data of 119880119894 into twoindependent sets 119879119894 and 119885119894 which are the training and testsets of the ith user in D respectively The splitting processfollowed the structure of the particular data configurationwhich is described in Section 3 After that we retrieved thestored optimized hyperparameters vector of the ith user (119867119894)from the database which is created in the previous DNN-experiments Then we constructed the RNN model that istuned by119867119894 In order to obtain the LSTM-RNNmodel everyneuron in any of the hidden layers is replaced to an LSTMmemory cell The constructed LSTM-RNN model is trainedon119879119894 and then tested on119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 6 depicts theflowchart of the methodology of LSTM-RNN-experiments

52 Dynamic Classification Approach In contrast of staticclassification approach dynamic classification approach doesnot need a ready-to-use dataset with static features [30] Itcovenants directly with raw data sources such as text imagevideo sound and signal files and extracts features from themdynamically The models that use this approach try to learnand represent features in unsupervised manner Then thesemodels train themselves using the extracted features to beable to classify unseen dataThe deep learningmodels fit verywell for this approach because the main objectives of deeplearning models are the strong ability of automatic featureextraction and self-learning Rather than that dynamicclassification models overcome the problem of the lake ofdatasets it performs more efficient than the static classifica-tionmodels Despite these advantages dynamic classificationapproach has also drawbacks Dynamic classification modelsare slower and take a long time to train if compared with

12 Security and Communication Networks

YesNo

Hi

Start

(1) InputData configuration D M

(2) Prepare files of D

(4) Split data of Ui

into Ti and Zi sets

(7) Train LSTM-RNN model on Ti

(8) Test LSTM-RNN model on Zi

End

(5) Database

(6) Construct LSTM-RNN model that is tuned by Hi

(3) ilarr1

(9) Obtain and save TPi FPi TNi andFNi for the user Ui

(10) ilarri+1

(11) Is i gt M

(13) Output TPFP TN and FN

(12) Compute andsave TP FP TN

and FN for D

Figure 6 The flowchart of the LSTM-RNN-experiments

static classification models due to complex deep structure ofthesemodels as well as the huge amount of computations thatare required to execute Furthermore dynamic classificationmodels require a very large amount of input samples to gainhigh accuracy values

In this research we used six data configurations that areimplemented from three textual datasets In order to applydynamic masquerade detection on these data configurationswe need amodel that is able to extract features from the userrsquoscommand text file dynamically and then classify the user intoone of the two classes that will be either a normal user or amasqueraderTherefore we dealwith a text classification taskThe text classification is defined as a task that assigns a pieceof text (a word a sentence or even a document) to one ormore classes according to its content Indeed there are threetypes of text classification namely sentence classificationsentiment analysis and document categorization In sentenceclassification a given sentence should be assigned correctlyto one of possible classes Furthermore sentiment analysisdetermines if a given sentence is a positive negative orneutral towards a specific subject In contrast documentcategorization deals with documents and determines whichclass from a given set of possible classes a document belongsto According to the nature of dynamic classification as well asthe functionality of text classification deep learning modelsare the fittest among the other machine learning models forthese types of classification due to their powerful capability offeatures learning

A wide range of researches have been accomplished inthe literature in the field of text classification using deeplearning models It was started by LeCun et al in 1998 whenthey proposed a special topology of the Convolutional NeuralNetwork (CNN) known as LeNet family and used it in textclassification efficiently [44]Then various studies have beenpublished to introduce text classification algorithms as wellas the factors that impact the performance [45ndash47] In thestudy [48] the CNNmodel is used for sentence classificationtask over a set of text dataset benchmarks A single one-dimensional CNN is proposed to learn a region-based textembedding [49] X Zhang et al introduced a novel character-based multidimensional CNN for text classification taskswith competitive results [50] In the research [51] a newhierarchal approach calledHierarchal Deep Learning for Text

classification (HDLTex) is proposed and three deep struc-tures which are DNN RNN and CNN are used A recurrentconvolutional network model is introduced [52] for textclassification and high results are obtained on documents-level datasets A novel LSTM-based model is introduced andused for text classification withmultitask learning framework[53] The study [54] proposed a new model called hierarchalattention network for document classification and is testedon six large document-level datasets with good results Acharacter-level text representations approach is proposed andtested for text classification tasks using deep CNN [55]As noticed the CNN is the mostly used deep learningmodel for text classification tasks So we decided to use theCNN to perform dynamic masquerade detection on all dataconfigurations The following subsection reviews the CNNand explains the structure of the used CNN model and themethodology of our CNN-experiments

521 Convolutional Neural Networks The ConvolutionalNeural Network (CNN) is a deep learning model whichis biological-inspired from the animal visual cortex TheCNN can be considered as a special type of the traditionalfeed-forwardArtificial Neural NetworkThemajor differencebetween ANN and CNN is that instead of the fully connectedarchitecture of ANN the individual neurons in CNN areconnected to subregions of the input field The neurons ofthe CNN are arranged in such a way they are tilled to coverthe entire input field The typical CNN consists of five maincomponents namely an input layer the convolutional layerthe pooling layer the fully connected layer and an outputlayer The input layer is where the input data is enteredinto the CNN The first convolutional layer in the CNNconsists of individual neurons that each of them is connectedto a small subset of the input field The neurons in thenext convolutional layers connect only to a subset of theirpreceding pooling layerrsquos outputMoreover the convolutionallayers in the CNN use a set of learnable kernels or filters thateach filter is applied to the specified subset of their precedinglayerrsquos output These filters calculate feature maps in whicheach feature map shares the same weights The poolinglayer also known as a subsampling layer is a nonlineardownsampling function that condenses subsets of its inputThemain goal of using pooling layers in the CNN is to reduce

Security and Communication Networks 13

Userrsquos Command Text Files

Quantization

Input Layer

Convolutional layer

C1 features map P1 features map

Max-Pooling layer

C2 P2 C6 P6

Fully-Connected dropout layers

2048 sigmoid neurons

2048 sigmoid neurons 2

softmaxneurons

Outputdense layer

0 (Normal)1 (Masquerader)

Figure 7 The architecture of the used CNNmodel

the complexity and computations by reducing the size of theirpreceding layerrsquos output There are many pooling nonlinearfunctions that can be used but among them max-poolingis the mostly used which selects the maximum value in thegiven pooling window Typically each convolutional layer inthe CNN is followed by a max-pooling layer The CNN hasone or more stacked convolutional layer and max-poolinglayer pairs to extract features from the entire input and thenmap these features to their next fully connected layerThe toplayers of the CNN are one or more of fully connected layerswhich are similar to hidden layers in the DNN This meansthat neurons of the fully connected layers are connected to allneurons of the preceding layer The output layer is the finallayer in the CNN and is responsible for reporting the outputvalue of the CNN Finally the back-propagation algorithm isusually used to train CNNs via Stochastic Gradient Decent(SGD) to adjust the weights of the fully connected layers [56]There are several variant structures of CNN that are proposedin the literature but LeNet structure which is proposed byLeCun et al [44] is themost common approach used inmanyapplications of computer vision and text classification

Regarding its stability and high efficiency in text clas-sification we selected the CNN model which is proposedin [50] to perform a dynamic masquerade detection on alldata configurationsThe usedmodel is a character-level CNNthat takes a text file as input and outputs the classificationscore (0 if the input text file is related to a normal user or1 otherwise) The used CNN model is from LeNet familyand consists of an input layer followed by six convolutionand max-pooling pairs followed by two fully connectedlayers and finally followed by an output layer In the inputlayer the text quantization process takes place when theused model encodes all letters in the input text file using aone-hot representation from a 70-character alphabet All theconvolutional layers in the used CNN model have a ReLUnonlinear activation functionThe two fully connected layersin the used CNN model are of the type dropout layer withdropout probability equal to 05 In addition to that the twofully connected layers in the usedCNNmodel have a Sigmoidnonlinear activation function as well as they have the samesize of 2048 neurons of each The output layer in the usedCNN model is of the type dense layer as well as it has asoftmax activation function and size of two neurons Theused CNN model is trained by back-propagation algorithmvia SGD Finally we set the following parameters to the

used CNN model learning rate=001 epochs=30 and batchsize=64 These values are obtained experimentally by per-forming a grid search to find the best possible values of theseparameters Figure 7 shows the architecture of the used CNNmodel and is reproduced from Zhang et al (2015) [under theCreative Commons Attribution Licensepublic domain]

In our work we used a CNNmodel to perform a dynamicmasquerade detection task on all data configurations Asmentioned in Section 511 there are six data configurationsand each of them will be used in the separate experimentSo we will have six separate CNN-experiments and eachexperiment will be on one of the data configurations Themethodology of all of these experiments is the same and asfollows for the given data configurationD we firstly preparedall the given data configurationrsquos text files such that each file ofthem represents the training and test sets of a user in119863 Nextto that for each user 119880119894 in D where i=12 M and119872 is thenumber of users in D we did the following steps we split thedata of 119880119894 into two independent sets 119879119894 and 119885119894 which are thetraining and test sets of the ith user in D respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 Furthermorewe also moved each block in the training and test sets of theuser 119880119894 to a separate text file This means that each of thetraining and test sets of the user 119880119894 consists of a specifiednumber of text files in which each text file contains one blockof UNIX commands After that we constructed the usedCNN model The constructed CNN model is trained on 119879119894and then tested on 119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 8 depicts theflowchart of the methodology of CNN-experiments

6 Results and Discussion

We carried out three major empirical experiments whichareDNN-experiments LSTM-RNN-experiments andCNN-experiments Each of them consists of six separate subex-periments where each subexperiment is performed on oneof the data configurations SEA SEA 1v49 Greenberg Trun-cated Greenberg Enriched PU Truncated and PU Enriched

14 Security and Communication Networks

YesNo

Start

(1) Input

(2) Prepare text files of D

(4) Split data of Ui

Ti and Zi text sets(6) Construct the used CNN model

(7) Train CNN model on Ti

(8) Test CNN model on Zi

(13) Output TP FP TN and FNEnd

(5) Move each block in Ti and Zi to a separate text file

Data configuration D M

(3) ilarr1

(12) Compute and save TPFP TN and FN for D

(9) Obtain and save TPi FPi TNiand FNi for the user Ui

(11) Is i gt M

(10) ilarri+1

into

Figure 8 The flowchart of the CNN-experiments

Table 6 The confusion matrix of the masquerade detection out-comes

Actual Class Predicted ClassNormal User Masquerader

Normal User TN FPMasquerader FN TP

Basically our PSO-based DNN hyperparameters selectionalgorithmwas implemented in Python 364 [57]withNumPy[58] Moreover all models (DNN LSTM-RNN CNN) wereconstructed and trained and tested based on Keras [59 60]with TensorFlow 16 [61 62] that backend over CUDA 90[63] and cuDNN 70 [64] In addition to that all experimentswere performed on a workstation with an Intel Core i7 CPU(38GHz 16 MB Cache) 16GB of RAM and theWindows 10operating system In order to accelerate the computations inall experiments we also used a GPU-accelerated computingwith NVIDIA Tesla K20 GPU 5GB GDDR5The experimen-tal environment is processed in 64-bit mode

In any classification task we have four possible outcomesTrue Positive (TP) True Negative (TN) False Positive (FP)and False Negative (FN) We get a TP when a masqueraderis correctly classified as a masquerader Whenever a gooduser is correctly classified as a good user itself we say it isa TN A FP occurs when a good user is misclassified as amasquerader In contrast FN occurs when a masqueraderis misclassified as a good user Table 6 shows the ConfusionMatrix of the masquerade detection outcomes For eachdata configuration we used the obtained outcomes for thatdata configuration to compute twelve well-known evaluationmetrics After that by using these evaluation metrics weassessed the performance of each deep learningmodel on thatdata configuration

For simplicity we divided these evaluation metrics intotwo categories General Classification Measures and Mas-querade Detection Measures The General ClassificationMeasures are metrics that are used for any classification tasknamely Accuracy Precision Recall and F1-Score On theother handMasquerade DetectionMeasures are metrics thatusually are used for a masquerade or intrusion detection

task which are Hit Rate Miss Rate False Alarm RateCost Bayesian Detection Rate Bayesian True Negative RateGeometric Mean andMatthews Correlation CoefficientTheused evaluation metrics definition and their correspondingequations are as follows

(i) Accuracy shows the rate of true detection over all testsets

119860119888119888119906119903119886119888119910 = 119879119875 + 119879119873119879119875 + 119879119873 + 119865119875 + 119865119873 (7)

(ii) Precision shows the rate of correctly classified mas-queraders from all blocks in the test set that areclassified as masqueraders

119875119903119890119888119894119904119894119900119899 = 119879119875119879119875 + 119865119875 (8)

(iii) Recall shows the rate of correctly classified masquer-aders over all masquerader blocks in the test set

119877119890119888119886119897119897 = 119879119875119879119875 + 119865119873 (9)

(iv) F1-Score gives information about the accuracy of aclassifier regarding both Precision (P) and Recall (R)metrics

1198651 119878119888119900119903119890 = 21119875 + 1119877 (10)

(v) Hit Rate shows the rate of correctly classified mas-querader blocks over all masquerader blocks pre-sented in the test set It is also called Hits TruePositive Rate or Detection Rate

119867119894119905 119877119886119905119890 = 119879119875119879119875 + 119865119873 (11)

(vi) Miss Rate is the complement of Hit Rate (Miss=100-Hit) ie it shows the rate of masquerade blocksthat are misclassified as a normal user from allmasquerade blocks in the test set It is also calledMisses or False Negative Rate

119872119894119904119904 119877119886119905119890 = 119865119873119865119873 + 119879119875 (12)

Security and Communication Networks 15

(vii) False Alarm Rate (FAR) gives information about therate of normal user blocks that are misclassified as amasquerader over all normal user blocks presented inthe test set It is also called False Positive Rate

119865119886119897119904119890 119860119897119886119903119898 119877119886119905119890 = 119865119875119865119875 + 119879119873 (13)

(viii) Cost is a metric that was proposed in [9] to evaluatethe efficiency of a classifier concerning bothMiss Rate(MR) and False Alarm Rate (FAR) metrics

119862119900119904119905 = 119872119877 + 6 times 119865119860119877 (14)

(ix) Bayesian Detection Rate (BDR) is a metric basedon Base-Rate Fallacy problem which is addressedby S Axelsson in 1999 [65] Base-Rate Fallacy is abasis of Bayesian statistics and occurs when peo-ple do not take the basic rate of incidence (Base-Rate) into their account when solving problems inprobabilities Unlike Hit Rate metric BDR shows therate of correctly classified masquerader blocks overall test set taking into consideration the base-rate ofmasqueraders Let I and Ilowast denote a masquerade anda normal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectivelyThen BDR can be computed asthe probability P(I | A) according to (15) [65]119861119886119910119890119904119894119886119899 119863119890119905119890119888119905119894119900119899 119877119886119905119890 = 119875 (119868 | 119860)

= 119875 (119868) times 119875 (119860 | 119868)119875 (119868) times 119875 (119860 | 119868) + 119875 (119868lowast) times 119875 (119860 | 119868lowast)(15)

P(I) is the rate of the masquerader blocks in the testset P(A | I) is the Hit Rate P(Ilowast) is the rate of thenormal blocks in the test set and P(A | Ilowast) is the FAR

(x) Bayesian True Negative Rate (BTNR) is also basedon Base-Rate Fallacy and shows the rate of trulyclassified normal blocks over all test set in which thepredicted normal behavior indicates really a normaluser [65] Let I and Ilowast denote a masquerade and anormal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectively Then BTNR can be computedas the probability P(Ilowast | Alowast) according to (16) [65]

119861119886119910119890119904119894119886119899 119879119903119906119890 119873119890119892119886119905119894V119890 119877119886119905119890 = 119875 (119868lowast | 119860lowast)= 119875 (119868lowast) times 119875 (119860lowast | 119868lowast)

119875 (119868lowast) times 119875 (119860lowast | 119868lowast) + 119875 (119868) times 119875 (119860lowast | 119868)(16)

P(Ilowast) is the rate of the normal blocks in the test setP(Alowast | Ilowast) is the True Negative Rate which is easilyobtained by calculating (1-FAR) P(I) is the rate of themasquerader blocks in the test set and P(Alowast | I) isthe Miss Rate

(xi) Geometric Mean (g-mean) is a performance metricthat combines true negative rate and true positive

rate at one specific threshold where both the errorsare considered equal This metric has been usedby several researchers for evaluating classifiers onimbalance dataset [66] It can be computed accordingto (17) [67]

119892 119898119890119886119899 = radic 119879119875 times 119879119873(119879119875 + 119865119873) times (119879119873 + 119865119875) (17)

(xii) Matthews Correlation Coefficient (MCC) is a perfor-mance metric that takes into account true and falsepositives and negatives and is generally regarded asa balanced measure which can be used even if theclasses are of very different sizes (imbalance dataset)[68] MCC has a range of minus1 to 1 where minus1 indicates acompletely wrong binary classifier while 1 indicates acompletely correct binary classifier Unlike the othermetrics discussed aboveMCC takes all the cells of theConfusion Matrix into consideration in its formulawhich can be computed according to (18) [69]

119872119862119862= (119879119875 times 119879119873) minus (119865119875 times 119865119873)radic(119879119875 + 119865119873) times (119879119875 + 119865119875) times (119879119873 + 119865119875) times (119879119873 + 119865119873)

(18)

In the following two subsections we will present our experi-mental results and explain them using two kinds of analysesperformance analysis and ROC curves analysis

61 Performance Analysis The effectiveness of any modelto detect masqueraders depends on its values of evaluationmetrics The higher values of Accuracy Precision RecallF1-Score Hit Rate Bayesian Detection Rate Bayesian TrueNegative Rate Geometric Mean and Matthews CorrelationCoefficient as well as the lower values of Miss Rate FalseAlarm Rate and Cost indicate an efficient classifierThe idealclassifier hasAccuracy andHit Rate values that reach 1 as wellasMiss Rate and False AlarmRate values that reach 0 Table 7presents the percentages of the used evaluation metricsfor DNN-experiments LSTM-RNN-experiments and CNN-experiments Actually the rows labeled by DNN and LSTM-RNN in Table 7 show results of the static masquerade detec-tion by using DNN and LSTM-RNN models respectivelywhereas the rows labeled by CNN in Table 7 show resultsof the dynamic masquerade detection by using CNN modelFurthermore the bold rows represent the best results amongthe same data configuration whereas the underlined valuesare the best for all data configurations

First of all the impact of using our PSO-based algorithmcan be seen in the obtained results of both DNN and LSTM-RNN models The PSO-based algorithm is used to optimizethe selection of DNN hyperparameters that maximized theaccuracy which means that the sum of TP and TN outcomeswill be increased significantly Thus according to (11) and(13) increasing the sum of TP and TN will lead definitelyto the increase of the value of Hit as well as to the decreaseof the value of FAR Although the accuracy values of SEA1v49 data configuration for all models are slightly lower than

16 Security and Communication Networks

Table 7 The results of our experiments

Dataset DataConfiguration Model Evaluation Metrics ()

Accuracy Precision Recall F1-Score Hit Miss FAR Cost BDR BTNR g-mean MCC

SEA Dataset

SEADNN 9808 7626 8485 8033 8485 1515 128 2283 7625 9926 9152 7945

LSTM-RNN 9852 8230 8658 8439 8658 1342 090 1883 8233 9934 9263 8364CNN 9884 8777 8701 8739 8701 1299 059 1651 8772 9937 93 8678

SEA 1v49DNN 9654 9998 9643 9817 9643 357 048 647 9998 5204 9796 7064

LSTM-RNN 9786 9998 9779 9887 9779 221 038 448 9998 6370 987 7874CNN 9878 9999 9874 9936 9874 126 019 240 9999 7551 9927 8622

GreenbergDataset

GreenbergTruncated

DNN 9397 9223 8067 8606 8067 1933 204 3157 9222 9441 8889 8253LSTM-RNN 9472 9488 8153 8770 8153 1847 132 2639 9487 9468 897 8476

CNN 9543 9616 8353 8940 8353 1647 10 2247 9616 9524 9094 8686

GreenbergEnriched

DNN 9757 9692 9240 9461 9240 760 088 1288 9692 9775 957 9308LSTM-RNN 9798 9757 9360 9554 9360 640 070 1060 9756 9810 9641 9428

CNN 9860 9855 9533 9692 9533 467 042 719 9855 9861 9743 9603

PU Dataset

PU TruncatedDNN 810 9959 7861 8786 7861 2139 225 3489 9959 3949 8766 5463

LSTM-RNN 8219 9969 7989 8870 7989 2011 175 3061 9968 4110 886 5646CNN 8375 9974 8164 8979 8164 1836 150 2736 9973 4338 8968 5879

PU EnrichedDNN 9044 9984 8921 9423 8921 1079 10 1679 9984 5672 9398 7064

LSTM-RNN 9131 9988 9018 9478 9018 982 075 1432 9988 5908 9461 7261CNN 9375 9992 9293 9630 9293 707 050 1007 9992 6678 9616 7852

the corresponding values of SEA data configuration also Hitvalues are dramatically increased in SEA 1v49 for all modelsby 10-14 from those that are in the SEA data configurationThis is due to the structure of SEA 1v49 data configurationwhere there are 122500 masquerader blocks in the test setof SEA 1v49 comparing to only 231 blocks in the SEA dataconfiguration Moreover the FAR values of SEA 1v49 for allmodels are significantly lower than the corresponding valuesof SEA data configuration Hence regarding SEA datasetSEA 1v49 is better to use in masquerade detection than SEAdata configuration

On the other hand as we expected Greenberg Enrichedenhanced noticeably the performance of all models in termsof all used evaluation metrics from the corresponding val-ues of Greenberg Truncated data configuration This canbe explained by the fact that Greenberg Enriched dataconfiguration has more information about user behaviorincluding command name parameters aliases and flagscomparing to only command name in Greenberg TruncatedTherefore regarding Greenberg dataset Greenberg Enricheddata configuration is better to use in masquerade detectionthan Greenberg Truncated The same thing happened inPU dataset where its PU Enriched data configuration hasbetter results regarding all models than PU Truncated Thusregarding PU dataset PU Enriched is better to use inmasquerade detection than PUTruncated data configuration

Actually PU Truncated and Greenberg Truncated dataconfigurations simulate SEA and SEA 1v49 data configu-rations where only command name is considered Despitethat regarding all used models SEA 1v49 recorded thebest results among the other truncated data configurationsOn the other hand PU Enriched and Greenberg Enriched

are considered as enriched data configurations where extrainformation about users is taken into consideration Due tothat enriched data configurations help models to build userrsquosbehavior profile more accurately than with truncated dataconfigurations Regarding all models the results associatedwithGreenberg Enriched especially in terms ofAccuracyHitand FAR values are better than of the corresponding valuesof PU Enriched data configuration because PU dataset isvery small masquerade detection dataset with a relatively lownumber of users (only 8 users) Also this reason can explainwhy a few previous works used PU dataset in masqueradedetection However data configurations can be sort for allused models from the upper to lower according to theobtained results as follows SEA 1v49 Greenberg EnrichedPU Enriched SEA Greenberg Truncated and PUTruncated

For the sake of brevity and space limitation we selected asubset of the used performancemetrics inTable 7 to be shownvisually in Figures 9 and 10 Figures 9(a) 9(b) 9(c) 9(d)9(e) 9(f) 9(g) and 9(h) showAccuracy HitMiss FAR CostBDR F1-Score and MCC percentages of the used modelsin each data configuration respectively Figures 10(a) 10(b)10(c) 10(d) 10(e) and 10(f) show Accuracy Hit FAR BDRF1-Score and MCC percentages for the average performanceof the used models on datasets respectively Figures 9 and10 can give us a visual comparison of the performance of theused deep learning models for each data configuration anddataset as well as in all datasets

By taking an inspective look to Figures 9 and 10 we cannotice the stability of deep learning models in such a waythat they are enhancing masquerade detection from a dataconfiguration to another in a consistent pattern To explainthat we will discuss the obtained results from the perspective

Security and Communication Networks 17

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnriched

PU EnrichedPU Truncated

0102030405060708090

100

Accura

cy (

)

(a)

0102030405060708090

100

Hit

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(b)

0

5

10

15

20

25

Miss

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(c)

002040608

112141618

22224

FAR

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(d)

0

5

10

15

20

25

30

35

Cos

t (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(e)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU Truncated

0102030405060708090

100

BDR

()

PU Enriched

(f)

Figure 9 Continued

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 3: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

Security and Communication Networks 3

Table 1 Best results of the related works

Model Dataset Configuration Hit () FAR ()HMM SEA SEA 60 1

Naive BayesSEA SEA 79 2

SEA 1v49 628 46

Greenberg Greenberg Truncated 709 47Greenberg Enriched 821 57

Conditional NaiveBayes

SEA SEA 84 88SEA 1v49 907 1

Greenberg Greenberg Enriched 8413 94PU PU Enriched 84 8

SVM

SEA SEA 826 3SEA 1v49 948 0

Greenberg Greenberg Truncated 711 6Greenberg Enriched 873 64

PU PU Enriched 60 2Tree-based PU PU Enriched 85 10

Table 2 Datasets and their characteristics

Dataset Name HostsPlatform

No ofUsers

AuditFormat Enriched Contaminated Sessions Real

Masquerades Year

SEA Unix 50 UnixCommands No Yes No No 2001

Greenberg Unix 168 UnixCommands Yes No Yes No 1988

PU Unix 8 UnixCommands Yes No Yes No 1997

Enriched data configuration In 2007 Yang et al presenteda One-Class SVM with string kernel classifier to detectmasquerade attacks [18] They tested their classifier on twoUNIX command line-based datasets namely SEA datasetand PU dataset [19] which latter is proposed by Lane andBrodley in 1997 For SEA dataset they applied their modelon SEA data configuration (Hit=62 FAR=15) and forPU dataset they applied their model on PU Enriched dataconfiguration (Hit=60 FAR=2) which is proposed [19]

In the study [17] a Naive Bayes model with updatingusers profile is introduced in 2003 on both Greenberg Trun-cated and Greenberg Enriched data configurations whereasGreenberg Truncated data configuration gave a Hit=709and a FAR=47 andGreenberg Enriched data configurationgave a Hit=821 and a FAR=57 Gebski and Wong [20]presented a tree-based model for masquerade detectionon PU Enriched data configuration (Hit=85 FAR=10)REDDY et al proposed a conditional Naive Bayes classifierto detect masquerades [21] They tested their classifier onthree different UNIX command line-based datasets namelySEA Greenberg and PU datasets For SEA dataset theyapplied their classifier on two data configurations namelySEA data configuration (Hit=84 FAR=88) and SEA 1v49data configuration (Hit=907 FAR=1) For Greenbergdataset they applied their classifier on Greenberg Enriched

data configuration (Hit=8413 FAR=94) Finally theytested their classifier on PU Enriched data configurationand they got a Hit=84 and a FAR=8 Table 1 presents asummarization of the best results of the previous works abovein terms of Hit percentage for each dataset As we can noticefrom Table 1 developing a masquerade detection models forhigher Accuracy and Hit as well as lower FAR values is still abig challenge

3 Datasets and Configurations

This section describes the datasets that we used in our studydata configurations and the methodology of training andtesting as well Indeed there are various mechanisms thatcould be used to collect information about each user tomodel his behavior and then build his normal profile such asuser command lines history graphical user interface (GUI)user file system navigation and system calls at the operatingsystem level In this paper we selected three datasets basedon UNIX command line history of users namely SEAGreenberg and PU Rather than being free and publiclyavailable on Internet they are the most commonly useddatasets in anomaly-based masquerade detection area soour results will be easily compared to previous ones Table 2shows datasets and their characteristics

4 Security and Communication Networks

31 SEA Dataset Recently published papers that focused onmasquerade detection area used this dataset SEA (SchonlauEt Al) is a free UNIX command line-based dataset [7] Theyused UNIX acct audit tool to collect commands from 50different users for several months SEA dataset contains aset of 15000 commands for every user and these commandscontain only command names issued by that user For eachuser the set of 15000 commands is divided into 150 blockseach with 100 commands The first 50 blocks for each userare considered genuine and used as a training set Theremaining 100 blocks of each user are considered as a testset Some of the test blocks are contaminated randomly withdata of other users ie each user has varying masqueraderblocks in his test set from 0 to 24 blocks Two associateddata configurations have been used with this dataset in theliterature SEA and SEA 1v49

311 SEA This data configuration is proposed in the study[7] A separate classifier is built for each of the 50 users Wetrained each classifier to build two profiles one profile forself-behavior using the first 50 blocks of the particular userand the other profile for non-self-behavior using (49 times 50)training blocks of the other 49 users The test set of each userwill be the same as described in Section 31

312 SEA 1v49 In this configuration we followed the samemethodology proposed in research [9] A classifier is built foreach user and trained only with the first 50 training blocks ofits data On the other hand the test set for each user consistsof the first 50 training blocks of each of the other 49 usersresulting in 2450masquerade blocks in addition to its originalnormal blocks which vary between 76 and 100 blocks

32 Greenberg Dataset This dataset has been proposed in[16] and widely used in previous works It contains com-mands collected from 168 UNIX users that used csh shellUsers of this dataset are considered to be a member in one ofthe following four groups novice programmers experiencedprogrammers computer scientists and nonprogrammersThis dataset is enriched ie it has sessions for each userincluding information about start and end time of the sessionworking directory command names command parameterscommand aliases and an error flag Two associated dataconfigurations have been used with this dataset in theliterature Greenberg Truncated and Greenberg Enriched

321 Greenberg Truncated In this configuration we fol-lowed the same methodology conducted by [17] First weextracted the truncated command lines from Greenbergdataset which contain only the command names Next from168 users available inGreenberg dataset we selected randomly50 users who have between 2000 and 5000 commands to actas normal users Then we divided commands of each of the50 users into blocks each with 10 commands The first 100blocks of each user will be his training set whereas the next100 blocks will be used as a validation of self-behavior in histest set After that we randomly selected additional 25 usersfrom the remaining 118 users to act as masqueraders Thenfor each of the 50 normal users we selected randomly 30

blocks from masqueradersrsquo data and input them at randompositions in his test set which results in a total of 130 blocksfor testing

322 Greenberg Enriched It has the same methodologyexplained in Greenberg Truncated but with only one differ-ence that for this data configuration we extracted only theenriched command lines from Greenberg dataset Enrichedcommand linemeans a concatenation of command name andcommand parameters entered by the user together with anyalias employed As for Greenberg Truncated data configura-tion described above Greenberg Enriched data configurationhas for each of the 50 normal users 100 blocks for training and130 blocks for testing

33 PU Dataset Purdue University (PU) dataset has beenproposed in [19] It contains sanitized commands collectedfrom 8 different users at Purdue University over the courseof up to 2 years This dataset is enriched which meansthat it contains in addition to command names commandparameters flags and shell meta-characters Furthermorethis dataset has sessions for each of the 8 users In addition tothat data of each user is processed into a token stream Tokenhere means either command name or command parameterTwo associated data configurations have been used with thisdataset in the literature PU Truncated and PU Enriched

331 PU Truncated For this configuration we followed thesame methodology used in [19] First we extracted onlythe truncated tokens from PU dataset ie the tokens thatcontain only command names Next for each of the 8 usersavailable in PU dataset we divided his data into blocks eachof 10 tokens Then the first 150 blocks of each user will beconsidered as his training set After that the next 50 blocksfor each user will be used as a validation of self-behavior inhis test set To simulate masquerade activities we added foreach user other seven usersrsquo testing data (7times 50)which resultsin a total of 400 blocks of testing for each of the 8 users

332 PU Enriched It has the same methodology explainedin PU Truncated but with only one difference that forPU Enriched data configuration we extracted here only theenriched tokens ie all tokens from PU dataset As forPU Truncated data configuration described in Section 331PU Enriched data configuration has for each of the 8 users150 blocks for training and 400 blocks for testing Table 3summarizes all details about data configurations

4 DNN Hyperparameters Selection

In this section we will present a Particle SwarmOptimization-based algorithm to select the hyperparametersof Deep Neural Networks (DNN) This algorithm will helpus to proceed in our experiments to construct DNN formasquerades detection as will be explained in Section 51DNN is a multilayer Artificial Neural Network with manyhidden layers The weights of DNN are fully connectedie every neuron at any particular layer is connected to allneurons of the higher-order layer that is located adjacently

Security and Communication Networks 5

Table 3 The structure of the used data configurations

Characteristics Data Configurations

SEA SEA 1v49 GreenbergTruncated

GreenbergEnriched

PUTruncated PU Enriched

Number of users 50 50 50 50 8 8Block Size 100 100 10 10 10 10

Number of blocks forevery user

Training set 2500 50 100 100 150 150Test set 100 2526sim2550 130 130 400 400Total 2600 2576sim2600 230 230 550 550

Number of blocks forall users

Training set 125000 2500 5000 5000 1200 1200Test set 5000 127269 6500 6500 3200 3200Total 130000 129769 11500 11500 4400 4400

Distribution of thetraining set

Normal 2500 2500 5000 5000 1200 1200Masquerader 122500 0 0 0 0 0

Total 125000 2500 5000 5000 1200 1200

Distribution of thetest set

Normal 4769 4769 5000 5000 400 400Masquerader 231 122500 1500 1500 2800 2800

Total 5000 127269 6500 6500 3200 3200

1

2

m

I1

I2

Im

Input LayerHidden Layers

Output Layer

1

2

j

1

2

k

1

2

n

1 h

O1

O2

On

2 h-1

Figure 1 The basic structure of a typical DNN

to that particular layer [4] The information in DNN ispropagated in a feed-forward manner that is from inputs tooutputs via hidden layers Figure 1 depicts the basic structureof a typical DNN

DNNs are widely used in various machine learning tasksIn addition to that they have proved their ability to surpassmost of the machine learning techniques in terms of perfor-mance [22] However the performance of any DNN relieson the selection of the values of its hyperparameters DNNhyperparameters are defined as a set of critical parametersthat control the architecture behavior and performance ofthat DNN in the underlying machine learning task Indeedthere are two kinds of such hyperparameters global parame-ters and layer-based parameters The global parameters arethose that defined the general behavior of DNN such aslearning rate epochs number batch size number of layers

and the used optimizer On the other hand layer-basedparameters values are dependent on each layer in DNNExamples of layer-based parameters are but not limitedto type of layer weight initialization method activationfunction and a number of neurons

The problem is that these hyperparameters are varyingfrom task to task and they must be set before the trainingprocess One familiar solution to overcome this problem isto find an expert who is conversant with the underlyingmachine learning task to tune precisely the DNN hyper-parameters Unfortunately the existence of such expert isnot available in all cases Another possible solution is toadjust these hyperparameters manually in a trial-and-errormanner This can be handled by searching the space ofhyperparameters by executing either grid search or randomsearch [23 24] A grid search is performed upon definedranges of hyperparameters where those ranges are identifiedpreviously depending on a prior knowledge of the underlyingtask After that the user picks up values of hyperparam-eters from the predefined ranges consecutively and teststhe performance of DNN on the training set When allpossible combination of hyperparameters values is testedthe best combination is selected to configure DNN andtest it on the test set Random search is similar to gridsearch but instead of picking up hyperparameters valuesin a methodical manner the user selects hyperparametersvalues from those predefined ranges randomly In 2012 Snoeket al have proposed a hyperparameters selection methodbased on Bayesian optimization [25] In this method theuser improves his knowledge of selecting hyperparametersby using the information gained from any given experimentto decide how to adjust the hyperparameters for the nextexperiment Despite good results that have been obtainedby the grid random and Bayesian optimization searchesin some cases in general the complexity and large search

6 Security and Communication Networks

space of theDNNhyperparameters valuesmake suchmanualalgorithms infeasible and too exhausting searching process

Evolutionary Algorithms (EAs) are metaheuristic algo-rithms which perform excellently for finding the globaloptima of a nonlinear function especially when there aremultiple local minima or maxima EAs are considered asvery promising algorithms for solving the problem of DNNparameterization automatically In the literature there are alot of studies that have been proposed recently aiming at usingEAs in optimizing DNN hyperparameters in order to gain ahigh accuracy value as much as possible Genetic Algorithm(GA) which is one of the most famous EAs has been usedto optimize the network parameters and the Taguchi methodis applied between the crossover and mutation operatorsincluding initial weights definition [26] GAs also are usedin the pretraining step prior to the supervised step based ona multiclass classification task [27] Another approach usingGA to reduce the training time has been presented in [28]TheGA is used to enhanceDeepNeuralNetworks by evolvinga neural networkrsquos weights [29] An automated GA-basedapproach has been proposed in [30] that optimized DNNhyperparameters for malware classification tasks MoreoverParticle Swarm Optimization is also one of the most well-known and popular EAs Lorenzo et al used PSO andproposed two approaches the first is sequential and thesecond is parallel to optimize hyperparameters of any DNN[31 32] Then Nalepa and Lorenzo proved formally theconvergence abilities of the former two approaches and testedthem separately on a single workstation and a cluster ofsequential and parallel approaches respectively [33] FinallyF Ye proposed in 2017 an automatic PSO-based algorithmto select DNN hyperparameters in large scale and highdimensional data [34]Thus we decided to use PSO to enableus to select hyperparameters for DNN automatically Thenin Section 51 we will explain how to adapt this algorithmfor static classification experiments used in a masqueradedetection scenario Section 41 introduces a necessary andbrief preface reviewing how standard PSO is working Thenthe rest of this section presents our proposed PSO-basedalgorithm to optimize DNN hyperparameters

41 Particle Swarm Optimization Particle Swarm Optimiza-tion (PSO) is a metaheuristic algorithm for optimizing non-linear functions in continuous search space It was proposedby Eberhart and Kennedy in 1995 [35] PSO tries to mimicthe social behavior of animals The swarm concept is a setof many members which are called particles The numberof particles in the swarm is an integer value denoted by119878 and called swarm size Every particle in the particularswarm has two vectors of 119873 length where 119873 is the sizeof the problem defined variables (dimensions) The firstvector is called position vector denoted by 119875 that identifiesthe current position of that particle in the search space ofthe problem Each position vector can be considered as acandidate solution of the problem The second vector iscalled velocity vector denoted by 119881 that determines bothspeed and direction of that particle in the search space ofthe problem at next iteration During the execution of PSOanother two vectors at every iteration should be stored The

first is called personal best vector denoted by 119875119894119887119890119904119905 whichindicates the best position of the 119894th particle in the swarmthat has been explored so far Each particle in the swarm hasits independent personal best vector from the other particlesand it is updated at each iteration The second vector is theglobal best vector denoted by Gbest which indicates the bestposition that has been found over the swarm so far There isa single global best vector for all particles in the swarm andit is updated at every iteration It can be looked to personalbest vector as the cognitive knowledge of the particle whereasthe global best vector represents the social knowledge ofthe swarm Mathematically for each particle 119894 in the swarm119878 at each iteration 119905 the velocity 119881 and position 119875 vectorsare updated to next iteration t+1 according to (1) and (2)respectively

119881119894119905+1 = 119882119881119894119905 + 11986211199031 (119905) (119875119894119887119890119904119905 minus 119875119894119905)+ 11986221199032 (119905) (119866119887119890119904119905 minus 119875119894119905)

(1)

119875119894119905+1 = 119875119894119905 + 119881119894119905+1 (2)

119882 is the inertia weight constant which controls the impactof the velocity of the particle at the current iteration onthe next iteration so the speed and direction of the particleare adjusted in order not to let the particle to get outsidethe search space of the problem Meanwhile 1198621 and 1198622 areconstants and known as acceleration coefficients 1199031 and 1199032are random values uniformly distributed in [0 1] At thebeginning of every iteration new values of 1199031 and 1199032 arecomputed randomly and they are constants for all particles inthe swarm at that iteration The goal of using 1198621 1198622 1199031 and1199032 constants is to scale both the cognitive knowledge of theparticle and the social knowledge of the swarmon the velocitychanges So the new position vectors of all particles willapproach to the optimal solution of the problem accordinglyFigure 2 depicts the flowchart of the standard PSO

In brief the standard PSOworks as follows First the userenters some required inputs like swarm size (S) dimensionsof the particles (N) acceleration constants (1198621 1198622) inertiaweight constant (W) fitness function (F) to score particleperformance in the problem domain and the maximumnumber of iterations (119905119898119886119909) Next PSO initializes positionand velocity vectors with the specified dimensions for allparticles in the swarm randomly Then PSO initializes thepersonal best vector for each particle in the swarm withthe specified dimensions and sets them to very small valueFurthermore PSO initializes the global best vector of theswarm with the specified dimensions and sets it to very smallvalue PSO computes the fitness score for each particle usingthe fitness function and updates the personal best vectorsfor all particles and the global best vector of the swarmAfter that PSO starts the first iteration by computing 1199031 and1199032 randomly and then updates velocity and position vectorsfor each particle according to (1) and (2) respectively Inaddition to that PSO computes again the fitness score foreach particle according to the given fitness function andupdates the personal best vector for each particle if the fitnessscore of that particle at this iteration is bigger than the fitness

Security and Communication Networks 7

YesNo

Start

(2) Initialize P and Vvectors particlesSof

each lengthNof

(5) For all S particles Compute F(P) and update Pi

best(6) Update Gbest

(8) Compute r1(t) and r2(t)(9) For all S particles

Update V P F(P) and Pibest

(10) Update Gbest

(12) Check Stop Criterion

satisfied

(13) Output Gbest as the optimal solution

Terminate

maxWF t

(1) Input SN C1 C2 (3) Pibest larr minusinfin i larr 1 to S

(4) Gbest larr minusinfin

(7) t larr 1

(11) t larr t+1

Figure 2 The flowchart of the standard PSO

score of the personal best vector of that particle (119865(119875119894119905 ) gt119865(119875119894119887119890119904119905)) Also PSO updates the global best vector of theswarm if any of the fitness score of the personal best vectorof the particles is bigger than the fitness score of the globalbest vector of the swarm (119865(119875119894119887119890119904119905) gt 119865(119866119887119890119904119905) i=1 to S)Then PSO checks the stop criterion and if one is satisfiedPSO will output the global best vector as the optimal solutionand terminate Else PSO will proceed to the next iterationand repeat the same procedure described in the first iterationabove until the stop criterion is reached

The stop criterion is satisfied when either the trainingerror is smaller than a predefined value () or the maximumnumber of iteration is reached Finally PSO performs betterthan GA in terms of simplicity and generality [36] PSO issimpler than GA because it contains only one operator andeasy to implement Also the generality of PSO means thatPSO does not need any modifications to be applied to anyoptimization problem as well as it is faster to converge to theoptimal solutionwhich decreases the computations and savesthe resources

42 DNN Hyperparameters Selection Using PSO The selec-tion of the hyperparameters of DNN can be interpreted as anoptimization task hence the main objective is to minimizethe loss function L(MT) where 119872 is the DNN model and119879 is the training set To achieve this goal we selected PSOto be our optimization algorithm that outputs the vectorof the optimized hyperparameters 119867 that minimized theloss function 119871 after constructed DNN model 119872 which istuned by the hyperparameters 119867 and trained on the trainingset 119879 The fitness function of our PSO-based algorithm isa function 119865lowast 119877119873 997888rarr 119877 that maps a real-valued vectorof hyperparameters that has a length of N to a real-valuedaccuracy value of the trained DNN that is tuned by thathyperparameters vector and tested on the test set 119885 Inother words our PSO-based algorithm finds the optimalhyperparameters vector among all possible combinations ofhyperparameters which yields to maximize the accuracy ofthe trained DNN on the test set Furthermore to ensurethe generality of our PSO-based algorithm which meansto be independent of the DNN that will be optimized andbe adapted easily to any classification task using DNN wewill allow the user to select which hyperparameters want touse in his work Therefore the user is responsible for usingour algorithm to define the number of the hyperparameters

as well as the type and domain of each parameter Thedomain of a parameter is the set of all possible values ofthat parameter After that our PSO-based algorithm willuse a special built-in generator that depends on the numberand domains of the defined parameters to initialize all theparticles (hyperparameters vectors) in the swarm

During the execution of the proposed algorithm andat each iteration the validation process is involved in theproposed algorithm to validate the updated position andvelocity vectors to be appropriate to the predefined rangesof parameters Finally in order to reduce computations andconverge faster two different stop conditions are checkedsimultaneously at the end of each iteration The first occurswhen the fitness score of the global best vector increasedless than a threshold which is specified by the userThe aim of the former condition is to guarantee that theglobal best vector cannot be improved further even if themaximumnumber of iterations is not reached yetThe secondcondition happens when the maximum number of iterationsis carried out Either the first or the second condition issatisfied then the proposed algorithm outputs the global bestvector as the optimal solution 119867 and terminates the searchprocess Figure 3 shows the flowchart of our PSO-basedDNNhyperparameters selection algorithm

43 Algorithm Steps

Inputs Number of hyperparameters (N) swarm size (S)acceleration constants (1198621 1198622) inertia constant (W) max-imum value of velocity (119881119898119886119909) minimum value of velocity(V119898119894119899) maximum number of iterations (t119898119886119909) evolutionthreshold () training set (T) and test set (Z)Output The optimal solution HProcedure

Step 1 For klarr9978881 to NLet h119896 be the k119905ℎ hyperparameterIf domain of h119896 is continuous then

let 119861119896119897119900119908 be the lower bound of h119896 and 119861119896119906119901be the upper bound of h119896

let user enter the lower and upper boundsof a hyperparameter h119896

End of if

8 Security and Communication Networks

(4) Initialize P and V vectors of Sparticles each of N length

(8) For all S particles

(12) For all S particles(16) Output

Yes

Terminate

Start User

(2) Define Domains for hk

(3) Create Hyper-parameters amp velocity generator

(1) Preprocessing Phase (2) Initialization Phase (3) Evolution Phase (4) Finishing Phase

No (15) Check Stop conditions

satisfied

(1) Input N S Vmin Vmax

klarr1 to N

(5) Input T Z C1 C2 W tmax

(6) Pibest larrminusinfin i larr1 to S(7) Gbest larr minusinfin

Compute Flowast(P) and update Pibest

(9) Update Gbest

(10) tlarr1

Compute V P Flowast(P) and Pibest

(13) Update Gbest

(14) tlarrt+1

(11) Compute r1(t) and r2(t)H larr Gbest

Figure 3 The flowchart of the proposed algorithm

Else

Let Y119896 be the set of all possible values of h119896

Let user enter all elements of the set Y119896

End of elseEnd of for

Step 2 Let 119865lowast be the fitness function which constructs DNNtuned with the given hyperparameters then trains DNN on119879 and tests it on 119885 Finally 119865lowast computes the accuracy ofDNN as output

Step 3 Let G119887119890119904119905 be the global best vector of the swarm oflength N

Let GS be the best fitness score of the swarmGSlarr997888 minusinfin

Step 4 For ilarr9978881 to SLet P119894 be the position vector of the 119894th particle oflength NLet V 119894 be the velocity vector of the 119894th particle oflength NLet 119875119894119887119890119904119905 be the personal best vector of the 119894thparticle of length NLet PS119894 be the fitness score of the personal bestvector of the 119894th particleFor jlarr9978881 to N

If domain of h119895 is continuous thenselect h119895 uniformly distributed

119875[119895] larr997888 U(119861119895119897119900119908

119861119895119906119901)End of ifElse

Select h119895 randomly by 119875119894[j] larr997888RAND (Y119895)

End of else119881119894[119895] larr997888 U(119881119898119894119899 119881119898119886119909)

End of for119875119894119887119890119904119905 larr997888 119875119894Let FS119894 be the fitness score of the 119894th particle

119865119878119894 larr997888 119865lowast(119875119894)119875119878119894 larr997888 119865119878119894If FS119894 gt GS then

119866119887119890119904119905 larr997888 119875119894119866119878 larr997888 119865119878119894

End of ifEnd of for

Step 5 Let GS119901119903119907 be the previous best fitness score of theswarm

119866119878119901119903V larr997888 119866119878Let 1199031 and 1199032 be random values in PSOLet 119905 be the current iterationFor tlarr9978881 to t119898119886119909

1199031 larr997888 119880(0 1)1199032 larr997888 119880(0 1)For ilarr997888 1 to S

Update V 119894 according to (1)Update P119894 according to (2)119865119878119894 larr997888 119865lowast(119875119894)If FS119894 gt PS119894 then119904119904119904119875119894119887119890119904119905 larr997888 119875119894119875119878119894 larr997888 119865119878119894End of ifIf PS119894 gt GS then119866119887119890119904119905 larr997888 119875119894119887119890119904119905119866119878 larr997888 119875119878119894End of if

End of forIf 119866119878- 119866119878119901119903V lt then

go to Step 6End of if

Security and Communication Networks 9

Table 4 PSO parameters recommended values or ranges

Parameter ValueRange119878 [5 20]119881119898119894119899 0119881119898119886119909 11198621 21198622 2119882 [04 09]119905119898119886119909 [30 50] 00001

119866119878119901119903V larr997888 119866119878End of for

Step 6 Let119867 be the optimal hyperparameters vector119867 larr997888 119866119887119890119904119905Return119867 and Terminate

44 PSO Parameters Selection of the value of PSO param-eters (S V119898119886119909 V119898119894119899 1198621 1198622 W t119898119886119909 ) is a very complexprocess Fortunatelymany empirical and theoretical previousstudies have been published to solve this problem [37ndash40] They introduced some recommended values of PSOparameters which can be taken Table 4 shows every PSOparameter and the corresponding recommended value orrange Thus for those parameters which have recommendedranges we can select a value for each parameter from its rangerandomly and fix it as a constant during the execution of PSO

5 Experimental Setup and Models

This section explains the methodology of performing ourempirical experiments as well as the description of deeplearning models which we used to detect masquerades Asmentioned in Section 3 we selected three UNIX commandline-based datasets (SEA Greenberg PU) Each of thesedatasets is a collection of text files inwhich each text file repre-sents a userThe text file of each user in the particular datasetcontains a set of UNIX commands that are issued by that userThis reflects the fact that these datasets do not contain anyreal masqueraders However to simulate masqueraders andto use these datasets in masquerade detection special dataconfigurations must be implemented prior to proceeding inour experiments According to Section 3 and its subsectionseach dataset has its two different types of data configurationsTherefore we obtained six data configurations that each onewill be observed separately which yields in the result to sixindependent experiments for each model Finally masquer-ade detection can be applied to these data configurationsby following two different main approaches namely staticclassification and dynamic classificationThe two subsequentsubsections present the difference between them as well aswhich deep learning models are exploited for each one

51 Static Classification Approach In the static classificationapproach the classification task is carried out using a dataset

of samples which are represented by a set of static features[30] These static features are defined according to the natureof the task where the classification will be applied In additionto that the dataset samples or also called observations arecollected manually by some experts working in the field ofthat classification task After that these samples are split intotwo independent sets known as training and test sets to trainand test the selected model respectively Static classificationapproach has pros and cons as well Although it provides afaster and easier solution it requires a ready-to-use datasetwith static features The existence of such dataset might notbe available in some complex classification tasks Hence theattempt to create a dataset with static features will be a hardmission In our work we decided to utilize the existenceof three famous UNIX command line-based datasets toimplement six different data configurations Each user inthe particular data configuration has a specific number ofblocks which are represented by a set of static featuresIndeed these features are the userrsquos UNIX commands incharge of describing the behavior of that user and laterhelping the classifier to detect masquerades We decided touse two well-known deep learning models namely DeepNeural Networks (DNN) and Recurrent Neural Networks(RNN) to accomplish the staticmasquerade detection task onthe implemented six data configurations

511 Deep Neural Networks In Section 4 we explained indetail the DNN structure and the problem of the selection ofits hyperparameters We also proposed PSO-based algorithmto obtain the optimal hyperparameters vector thatmaximizedthe accuracy of the DNN on the given training and test setsIn this subsection we describe how we utilized the proposedPSO-based algorithm and the DNN in static masqueradedetection task using six of data configurations which areSEA SEA 1v49 Greenberg Truncated Greenberg EnrichedPU Truncated and PU Enriched Every data configurationof them has its structure and a specific number of users asdescribed in Section 3 So we will have six separate DNN-experiments and each experiment will be on one of the dataconfigurations

The methodology of our DNN-experiments consists offour consecutive stages which are initialization optimiza-tion results extraction and finishing stages The first stageis to initialize all required operating parameters as well asto prepare the particular data configurationrsquos files in whicheach file represents a user in that data configurationThe userfile consists of the training set followed by the test set of thatuser We set all PSO parameters for all DNN-experiments asfollows S=20 V119898119894119899=0 V119898119886119909= 1 1198621=1198622=2 W=09 t119898119886119909=30and =10minus4 Then the last step in the initialization stage is todefine hyperparameters of the DNN and their domains Weused twelve different DNN hyperparameters (N=12) Table 5shows each DNN hyperparameter and its correspondingdefined domain All the used hyperparameters are numericalexcept that Optimizer Layer type Initialization function andActivation function hyperparameters are categorical In thiscase a list of all possible values is indexed to a sequenced-numbered range from 1 to the length of that list Optimizerlist includes elements Adagrad Nadam Adam Adamax

10 Security and Communication Networks

Table 5 The used DNN hyperparameters and their domains

Hyperparameter Domain DescriptionLearning rate [001 09] ContinuousMomentum [01 09] ContinuousDecay [0001 001 ContinuousDropout rate [01 09] ContinuousNumber of hidden layers [1 10] Discrete with step=1Numbers of neurons of hidden layers [1 100] Discrete with step=1Number of epochs [5 20] Discrete with step=5Batch size [100 1000] Discrete with step=50Optimizer [1 6] Discrete with step=1Initialization function [1 8] Discrete with step=1Layer type [1 2] Discrete with step=1Activation function [1 8] Discrete with step=1

RMSprop and SGD Layer type list contains two elementswhich are Dropout and Dense Initialization function listincludes elements Zero Normal Lecun uniform UniformGlorot uniform Glorot normal He uniform and He normalFinally Activation list has eight elements which are LinearSoftmax ReLU Sigmoid Tanh Hard Sigmoid Softsign andSoftplus It is worth mentioning that the elements of all cate-gorical hyperparameters are defined inKeras implementation[30]

The optimization and results extraction stages will beperformed once for each user in the particular data configu-ration that is they will be repeated for each user119880119894 i=12 M where 119872 is the number of users in the particular dataconfiguration119863The optimization stage starts by splitting thedata of the user119880119894 into two independent sets119879119894 and119885119894 whichare the training and test sets of the ith user respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 All blocks ofthe training and test sets are converted from text to numericvalues and then are normalized in [0 1] After that wesupplied these sets to the proposed PSO-based algorithm tofind the optimized hyperparameters vector119867119894 for the ith userIn addition to that we will save a copy of 119867119894 values in adatabase in order to save time and use them again in theRNN-experiment of that particular data configuration D aswill be presented in Section 512 The results extraction stagetakes place when constructing the DNN that is tuned by 119867119894trains the DNN on 119879119894 and tests the DNN on119885119894 The values ofthe classification outcomes True Positive (TP119894) False Positive(FP119894) True Negative (TN 119894) and False Negative (FN 119894) for theith user in the particular data configuration 119863 are extractedand saved for further processing later

Then the next user is observed and same procedure ofoptimization and results extraction stages is performed tillthe last user in the particular data configuration119863 is reachedFinally when all users in the particular data configurationare completed the last stage (finishing stage) is executedFinishing stage computes the summation of all obtained TPsof all users in the particular data configuration 119863 denotedby TP The same process will be applied also to the otheroutcomes namely FP TN and FN Equations (3) (4)

(5) and (6) express the formulas of TP FP TN and FNrespectively

119879119875 = 119872sum119894=1

119879119875119894 (3)

119865119875 = 119872sum119894=1

119865119875119894 (4)

119879119873 = 119872sum119894=1

119879119873119894 (5)

119865119873 = 119872sum119894=1

119865119873119894 (6)

The finishing stage will report and save these outcomes andend the DNN-experiment for the particular data configura-tion 119863 The former outcomes will be used to compute tenwell-known evaluation metrics to assess the performanceof the DNN on the particular data configuration D as willbe presented in Section 6 It is worth saying that the sameprocedure which is explained above will be done for eachdata configuration Figure 4 depicts the flowchart of themethodology of the DNN-experiments

512 Recurrent Neural Networks TheRecurrent Neural Net-work is a special type of the traditional feed-forwardArtificialNeural Network Unlike traditional ANN in the RNN eachneuron in any of the hidden layers has additional connectionsfrom its output to itself (self-recurrent) as well as to otherneurons of the same hidden layer Therefore the output ofthe RNNrsquos hidden layer at any time step (t) is for the currentinputs and the output of the hidden layer at the previous timestep (t-1) In RNN these directed cycles allow informationto circulate in the network and make the hidden layers asthe storage unit of the whole network [41] The importantcharacteristics of the RNN are the capability to have memoryand generate periodical sequences

Despite that the conventional RNN structure which isdescribed above has a serious problem especially when the

Security and Communication Networks 11

(9) Construct DNN that is tuned by Hi

(10) Train DNN on Ti

(11) Test DNN on Zi

No

(16) OutputTP FP TN and FN

Yes

End

Start

(1) Input Data configuration D M

(2) Set PSO parameters values

(3) Define Domains for Hyper-parameters

(3) Results Extraction Stage (4) Finishing Stage(1) Initialization Stage (2) Optimization Stage

(6) Execute the proposed PSO-based algorithm

(15) Compute and save TP FP TN and FN for D

(8) Database

(4) ilarr1

(7) Obtain Hi of the user Ui

(5) Create Ti and Zi sets of the user Ui

Hi

(12) Obtain and save TPi FPi TNi andFNi for the user Ui

(14) Is i gt M

(13) ilarri+1

Figure 4 The flowchart of the DNN-experiments

Inputxt

it ctℎt

Outputot

ft

Figure 5 The structure of an LSTM cell [6]

RNN is trained using the back-propagation technique Theproblem is known as gradient vanishing and exploding [42]The gradient vanishing problem occurs when the gradientsignal gets so small over the network which causes learningto become very slow or stop On the other hand the gradientexploding problem occurs when the gradient signal gets solarge in which learning divergesThis problem of the conven-tional RNN limited the use of the RNN to be only in short-termmemory tasks To solve this problem a new architectureof RNN is proposed by Hochreiter and Schmidhuber [43]known as Long Short-Term Memory (LSTM) LSTM uses anew structure called a memory cell that is composed of fourparts which are an input gate a neuron with a self-recurrentconnection a forget gate and the output gateMeanwhile themain goal of using a neuron with a self-recurrent connectionis to record information the aim of using three gates is tocontrol the flow of information from or into the memory cellThe input gate decides if to allow the incoming informationto enter into the memory cell or block it Moreover the forgetgate controls if to pass the previous state of the memory cellto alter the current state of the memory cell or prevent itFinally the output gate determines if to pass the output ofthe memory cell or not Figure 5 shows the structure of anLSTM memory cell Rather than overcoming the problemsof the conventional RNN LSTM model also outperformsthe conventional RNN in terms of performance especially inlong-term memory tasks [5] The LSTM-RNN model can beobtained by replacing every neuron in the hidden layers ofthe RNN to an LSTMmemory cell [6]

In this study we used the LSTM-RNN model to performa static masquerade detection task on all data configurationsAs mentioned in Section 511 there are six data config-urations and each of them will be used in the separate

experiment So we will have six separate LSTM-RNN-experiments each experiment will be on one of the dataconfigurations The methodology of all of these experimentsis the same and as follows for the given data configurationD we firstly prepared all the given data configurationrsquos filesby converting all blocks from text to numerical values andthen normalizing them in [0 1] Next to that for each user119880119894 in D where i=12 M and 119872 is the number of users inD we did the following steps we split the data of 119880119894 into twoindependent sets 119879119894 and 119885119894 which are the training and testsets of the ith user in D respectively The splitting processfollowed the structure of the particular data configurationwhich is described in Section 3 After that we retrieved thestored optimized hyperparameters vector of the ith user (119867119894)from the database which is created in the previous DNN-experiments Then we constructed the RNN model that istuned by119867119894 In order to obtain the LSTM-RNNmodel everyneuron in any of the hidden layers is replaced to an LSTMmemory cell The constructed LSTM-RNN model is trainedon119879119894 and then tested on119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 6 depicts theflowchart of the methodology of LSTM-RNN-experiments

52 Dynamic Classification Approach In contrast of staticclassification approach dynamic classification approach doesnot need a ready-to-use dataset with static features [30] Itcovenants directly with raw data sources such as text imagevideo sound and signal files and extracts features from themdynamically The models that use this approach try to learnand represent features in unsupervised manner Then thesemodels train themselves using the extracted features to beable to classify unseen dataThe deep learningmodels fit verywell for this approach because the main objectives of deeplearning models are the strong ability of automatic featureextraction and self-learning Rather than that dynamicclassification models overcome the problem of the lake ofdatasets it performs more efficient than the static classifica-tionmodels Despite these advantages dynamic classificationapproach has also drawbacks Dynamic classification modelsare slower and take a long time to train if compared with

12 Security and Communication Networks

YesNo

Hi

Start

(1) InputData configuration D M

(2) Prepare files of D

(4) Split data of Ui

into Ti and Zi sets

(7) Train LSTM-RNN model on Ti

(8) Test LSTM-RNN model on Zi

End

(5) Database

(6) Construct LSTM-RNN model that is tuned by Hi

(3) ilarr1

(9) Obtain and save TPi FPi TNi andFNi for the user Ui

(10) ilarri+1

(11) Is i gt M

(13) Output TPFP TN and FN

(12) Compute andsave TP FP TN

and FN for D

Figure 6 The flowchart of the LSTM-RNN-experiments

static classification models due to complex deep structure ofthesemodels as well as the huge amount of computations thatare required to execute Furthermore dynamic classificationmodels require a very large amount of input samples to gainhigh accuracy values

In this research we used six data configurations that areimplemented from three textual datasets In order to applydynamic masquerade detection on these data configurationswe need amodel that is able to extract features from the userrsquoscommand text file dynamically and then classify the user intoone of the two classes that will be either a normal user or amasqueraderTherefore we dealwith a text classification taskThe text classification is defined as a task that assigns a pieceof text (a word a sentence or even a document) to one ormore classes according to its content Indeed there are threetypes of text classification namely sentence classificationsentiment analysis and document categorization In sentenceclassification a given sentence should be assigned correctlyto one of possible classes Furthermore sentiment analysisdetermines if a given sentence is a positive negative orneutral towards a specific subject In contrast documentcategorization deals with documents and determines whichclass from a given set of possible classes a document belongsto According to the nature of dynamic classification as well asthe functionality of text classification deep learning modelsare the fittest among the other machine learning models forthese types of classification due to their powerful capability offeatures learning

A wide range of researches have been accomplished inthe literature in the field of text classification using deeplearning models It was started by LeCun et al in 1998 whenthey proposed a special topology of the Convolutional NeuralNetwork (CNN) known as LeNet family and used it in textclassification efficiently [44]Then various studies have beenpublished to introduce text classification algorithms as wellas the factors that impact the performance [45ndash47] In thestudy [48] the CNNmodel is used for sentence classificationtask over a set of text dataset benchmarks A single one-dimensional CNN is proposed to learn a region-based textembedding [49] X Zhang et al introduced a novel character-based multidimensional CNN for text classification taskswith competitive results [50] In the research [51] a newhierarchal approach calledHierarchal Deep Learning for Text

classification (HDLTex) is proposed and three deep struc-tures which are DNN RNN and CNN are used A recurrentconvolutional network model is introduced [52] for textclassification and high results are obtained on documents-level datasets A novel LSTM-based model is introduced andused for text classification withmultitask learning framework[53] The study [54] proposed a new model called hierarchalattention network for document classification and is testedon six large document-level datasets with good results Acharacter-level text representations approach is proposed andtested for text classification tasks using deep CNN [55]As noticed the CNN is the mostly used deep learningmodel for text classification tasks So we decided to use theCNN to perform dynamic masquerade detection on all dataconfigurations The following subsection reviews the CNNand explains the structure of the used CNN model and themethodology of our CNN-experiments

521 Convolutional Neural Networks The ConvolutionalNeural Network (CNN) is a deep learning model whichis biological-inspired from the animal visual cortex TheCNN can be considered as a special type of the traditionalfeed-forwardArtificial Neural NetworkThemajor differencebetween ANN and CNN is that instead of the fully connectedarchitecture of ANN the individual neurons in CNN areconnected to subregions of the input field The neurons ofthe CNN are arranged in such a way they are tilled to coverthe entire input field The typical CNN consists of five maincomponents namely an input layer the convolutional layerthe pooling layer the fully connected layer and an outputlayer The input layer is where the input data is enteredinto the CNN The first convolutional layer in the CNNconsists of individual neurons that each of them is connectedto a small subset of the input field The neurons in thenext convolutional layers connect only to a subset of theirpreceding pooling layerrsquos outputMoreover the convolutionallayers in the CNN use a set of learnable kernels or filters thateach filter is applied to the specified subset of their precedinglayerrsquos output These filters calculate feature maps in whicheach feature map shares the same weights The poolinglayer also known as a subsampling layer is a nonlineardownsampling function that condenses subsets of its inputThemain goal of using pooling layers in the CNN is to reduce

Security and Communication Networks 13

Userrsquos Command Text Files

Quantization

Input Layer

Convolutional layer

C1 features map P1 features map

Max-Pooling layer

C2 P2 C6 P6

Fully-Connected dropout layers

2048 sigmoid neurons

2048 sigmoid neurons 2

softmaxneurons

Outputdense layer

0 (Normal)1 (Masquerader)

Figure 7 The architecture of the used CNNmodel

the complexity and computations by reducing the size of theirpreceding layerrsquos output There are many pooling nonlinearfunctions that can be used but among them max-poolingis the mostly used which selects the maximum value in thegiven pooling window Typically each convolutional layer inthe CNN is followed by a max-pooling layer The CNN hasone or more stacked convolutional layer and max-poolinglayer pairs to extract features from the entire input and thenmap these features to their next fully connected layerThe toplayers of the CNN are one or more of fully connected layerswhich are similar to hidden layers in the DNN This meansthat neurons of the fully connected layers are connected to allneurons of the preceding layer The output layer is the finallayer in the CNN and is responsible for reporting the outputvalue of the CNN Finally the back-propagation algorithm isusually used to train CNNs via Stochastic Gradient Decent(SGD) to adjust the weights of the fully connected layers [56]There are several variant structures of CNN that are proposedin the literature but LeNet structure which is proposed byLeCun et al [44] is themost common approach used inmanyapplications of computer vision and text classification

Regarding its stability and high efficiency in text clas-sification we selected the CNN model which is proposedin [50] to perform a dynamic masquerade detection on alldata configurationsThe usedmodel is a character-level CNNthat takes a text file as input and outputs the classificationscore (0 if the input text file is related to a normal user or1 otherwise) The used CNN model is from LeNet familyand consists of an input layer followed by six convolutionand max-pooling pairs followed by two fully connectedlayers and finally followed by an output layer In the inputlayer the text quantization process takes place when theused model encodes all letters in the input text file using aone-hot representation from a 70-character alphabet All theconvolutional layers in the used CNN model have a ReLUnonlinear activation functionThe two fully connected layersin the used CNN model are of the type dropout layer withdropout probability equal to 05 In addition to that the twofully connected layers in the usedCNNmodel have a Sigmoidnonlinear activation function as well as they have the samesize of 2048 neurons of each The output layer in the usedCNN model is of the type dense layer as well as it has asoftmax activation function and size of two neurons Theused CNN model is trained by back-propagation algorithmvia SGD Finally we set the following parameters to the

used CNN model learning rate=001 epochs=30 and batchsize=64 These values are obtained experimentally by per-forming a grid search to find the best possible values of theseparameters Figure 7 shows the architecture of the used CNNmodel and is reproduced from Zhang et al (2015) [under theCreative Commons Attribution Licensepublic domain]

In our work we used a CNNmodel to perform a dynamicmasquerade detection task on all data configurations Asmentioned in Section 511 there are six data configurationsand each of them will be used in the separate experimentSo we will have six separate CNN-experiments and eachexperiment will be on one of the data configurations Themethodology of all of these experiments is the same and asfollows for the given data configurationD we firstly preparedall the given data configurationrsquos text files such that each file ofthem represents the training and test sets of a user in119863 Nextto that for each user 119880119894 in D where i=12 M and119872 is thenumber of users in D we did the following steps we split thedata of 119880119894 into two independent sets 119879119894 and 119885119894 which are thetraining and test sets of the ith user in D respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 Furthermorewe also moved each block in the training and test sets of theuser 119880119894 to a separate text file This means that each of thetraining and test sets of the user 119880119894 consists of a specifiednumber of text files in which each text file contains one blockof UNIX commands After that we constructed the usedCNN model The constructed CNN model is trained on 119879119894and then tested on 119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 8 depicts theflowchart of the methodology of CNN-experiments

6 Results and Discussion

We carried out three major empirical experiments whichareDNN-experiments LSTM-RNN-experiments andCNN-experiments Each of them consists of six separate subex-periments where each subexperiment is performed on oneof the data configurations SEA SEA 1v49 Greenberg Trun-cated Greenberg Enriched PU Truncated and PU Enriched

14 Security and Communication Networks

YesNo

Start

(1) Input

(2) Prepare text files of D

(4) Split data of Ui

Ti and Zi text sets(6) Construct the used CNN model

(7) Train CNN model on Ti

(8) Test CNN model on Zi

(13) Output TP FP TN and FNEnd

(5) Move each block in Ti and Zi to a separate text file

Data configuration D M

(3) ilarr1

(12) Compute and save TPFP TN and FN for D

(9) Obtain and save TPi FPi TNiand FNi for the user Ui

(11) Is i gt M

(10) ilarri+1

into

Figure 8 The flowchart of the CNN-experiments

Table 6 The confusion matrix of the masquerade detection out-comes

Actual Class Predicted ClassNormal User Masquerader

Normal User TN FPMasquerader FN TP

Basically our PSO-based DNN hyperparameters selectionalgorithmwas implemented in Python 364 [57]withNumPy[58] Moreover all models (DNN LSTM-RNN CNN) wereconstructed and trained and tested based on Keras [59 60]with TensorFlow 16 [61 62] that backend over CUDA 90[63] and cuDNN 70 [64] In addition to that all experimentswere performed on a workstation with an Intel Core i7 CPU(38GHz 16 MB Cache) 16GB of RAM and theWindows 10operating system In order to accelerate the computations inall experiments we also used a GPU-accelerated computingwith NVIDIA Tesla K20 GPU 5GB GDDR5The experimen-tal environment is processed in 64-bit mode

In any classification task we have four possible outcomesTrue Positive (TP) True Negative (TN) False Positive (FP)and False Negative (FN) We get a TP when a masqueraderis correctly classified as a masquerader Whenever a gooduser is correctly classified as a good user itself we say it isa TN A FP occurs when a good user is misclassified as amasquerader In contrast FN occurs when a masqueraderis misclassified as a good user Table 6 shows the ConfusionMatrix of the masquerade detection outcomes For eachdata configuration we used the obtained outcomes for thatdata configuration to compute twelve well-known evaluationmetrics After that by using these evaluation metrics weassessed the performance of each deep learningmodel on thatdata configuration

For simplicity we divided these evaluation metrics intotwo categories General Classification Measures and Mas-querade Detection Measures The General ClassificationMeasures are metrics that are used for any classification tasknamely Accuracy Precision Recall and F1-Score On theother handMasquerade DetectionMeasures are metrics thatusually are used for a masquerade or intrusion detection

task which are Hit Rate Miss Rate False Alarm RateCost Bayesian Detection Rate Bayesian True Negative RateGeometric Mean andMatthews Correlation CoefficientTheused evaluation metrics definition and their correspondingequations are as follows

(i) Accuracy shows the rate of true detection over all testsets

119860119888119888119906119903119886119888119910 = 119879119875 + 119879119873119879119875 + 119879119873 + 119865119875 + 119865119873 (7)

(ii) Precision shows the rate of correctly classified mas-queraders from all blocks in the test set that areclassified as masqueraders

119875119903119890119888119894119904119894119900119899 = 119879119875119879119875 + 119865119875 (8)

(iii) Recall shows the rate of correctly classified masquer-aders over all masquerader blocks in the test set

119877119890119888119886119897119897 = 119879119875119879119875 + 119865119873 (9)

(iv) F1-Score gives information about the accuracy of aclassifier regarding both Precision (P) and Recall (R)metrics

1198651 119878119888119900119903119890 = 21119875 + 1119877 (10)

(v) Hit Rate shows the rate of correctly classified mas-querader blocks over all masquerader blocks pre-sented in the test set It is also called Hits TruePositive Rate or Detection Rate

119867119894119905 119877119886119905119890 = 119879119875119879119875 + 119865119873 (11)

(vi) Miss Rate is the complement of Hit Rate (Miss=100-Hit) ie it shows the rate of masquerade blocksthat are misclassified as a normal user from allmasquerade blocks in the test set It is also calledMisses or False Negative Rate

119872119894119904119904 119877119886119905119890 = 119865119873119865119873 + 119879119875 (12)

Security and Communication Networks 15

(vii) False Alarm Rate (FAR) gives information about therate of normal user blocks that are misclassified as amasquerader over all normal user blocks presented inthe test set It is also called False Positive Rate

119865119886119897119904119890 119860119897119886119903119898 119877119886119905119890 = 119865119875119865119875 + 119879119873 (13)

(viii) Cost is a metric that was proposed in [9] to evaluatethe efficiency of a classifier concerning bothMiss Rate(MR) and False Alarm Rate (FAR) metrics

119862119900119904119905 = 119872119877 + 6 times 119865119860119877 (14)

(ix) Bayesian Detection Rate (BDR) is a metric basedon Base-Rate Fallacy problem which is addressedby S Axelsson in 1999 [65] Base-Rate Fallacy is abasis of Bayesian statistics and occurs when peo-ple do not take the basic rate of incidence (Base-Rate) into their account when solving problems inprobabilities Unlike Hit Rate metric BDR shows therate of correctly classified masquerader blocks overall test set taking into consideration the base-rate ofmasqueraders Let I and Ilowast denote a masquerade anda normal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectivelyThen BDR can be computed asthe probability P(I | A) according to (15) [65]119861119886119910119890119904119894119886119899 119863119890119905119890119888119905119894119900119899 119877119886119905119890 = 119875 (119868 | 119860)

= 119875 (119868) times 119875 (119860 | 119868)119875 (119868) times 119875 (119860 | 119868) + 119875 (119868lowast) times 119875 (119860 | 119868lowast)(15)

P(I) is the rate of the masquerader blocks in the testset P(A | I) is the Hit Rate P(Ilowast) is the rate of thenormal blocks in the test set and P(A | Ilowast) is the FAR

(x) Bayesian True Negative Rate (BTNR) is also basedon Base-Rate Fallacy and shows the rate of trulyclassified normal blocks over all test set in which thepredicted normal behavior indicates really a normaluser [65] Let I and Ilowast denote a masquerade and anormal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectively Then BTNR can be computedas the probability P(Ilowast | Alowast) according to (16) [65]

119861119886119910119890119904119894119886119899 119879119903119906119890 119873119890119892119886119905119894V119890 119877119886119905119890 = 119875 (119868lowast | 119860lowast)= 119875 (119868lowast) times 119875 (119860lowast | 119868lowast)

119875 (119868lowast) times 119875 (119860lowast | 119868lowast) + 119875 (119868) times 119875 (119860lowast | 119868)(16)

P(Ilowast) is the rate of the normal blocks in the test setP(Alowast | Ilowast) is the True Negative Rate which is easilyobtained by calculating (1-FAR) P(I) is the rate of themasquerader blocks in the test set and P(Alowast | I) isthe Miss Rate

(xi) Geometric Mean (g-mean) is a performance metricthat combines true negative rate and true positive

rate at one specific threshold where both the errorsare considered equal This metric has been usedby several researchers for evaluating classifiers onimbalance dataset [66] It can be computed accordingto (17) [67]

119892 119898119890119886119899 = radic 119879119875 times 119879119873(119879119875 + 119865119873) times (119879119873 + 119865119875) (17)

(xii) Matthews Correlation Coefficient (MCC) is a perfor-mance metric that takes into account true and falsepositives and negatives and is generally regarded asa balanced measure which can be used even if theclasses are of very different sizes (imbalance dataset)[68] MCC has a range of minus1 to 1 where minus1 indicates acompletely wrong binary classifier while 1 indicates acompletely correct binary classifier Unlike the othermetrics discussed aboveMCC takes all the cells of theConfusion Matrix into consideration in its formulawhich can be computed according to (18) [69]

119872119862119862= (119879119875 times 119879119873) minus (119865119875 times 119865119873)radic(119879119875 + 119865119873) times (119879119875 + 119865119875) times (119879119873 + 119865119875) times (119879119873 + 119865119873)

(18)

In the following two subsections we will present our experi-mental results and explain them using two kinds of analysesperformance analysis and ROC curves analysis

61 Performance Analysis The effectiveness of any modelto detect masqueraders depends on its values of evaluationmetrics The higher values of Accuracy Precision RecallF1-Score Hit Rate Bayesian Detection Rate Bayesian TrueNegative Rate Geometric Mean and Matthews CorrelationCoefficient as well as the lower values of Miss Rate FalseAlarm Rate and Cost indicate an efficient classifierThe idealclassifier hasAccuracy andHit Rate values that reach 1 as wellasMiss Rate and False AlarmRate values that reach 0 Table 7presents the percentages of the used evaluation metricsfor DNN-experiments LSTM-RNN-experiments and CNN-experiments Actually the rows labeled by DNN and LSTM-RNN in Table 7 show results of the static masquerade detec-tion by using DNN and LSTM-RNN models respectivelywhereas the rows labeled by CNN in Table 7 show resultsof the dynamic masquerade detection by using CNN modelFurthermore the bold rows represent the best results amongthe same data configuration whereas the underlined valuesare the best for all data configurations

First of all the impact of using our PSO-based algorithmcan be seen in the obtained results of both DNN and LSTM-RNN models The PSO-based algorithm is used to optimizethe selection of DNN hyperparameters that maximized theaccuracy which means that the sum of TP and TN outcomeswill be increased significantly Thus according to (11) and(13) increasing the sum of TP and TN will lead definitelyto the increase of the value of Hit as well as to the decreaseof the value of FAR Although the accuracy values of SEA1v49 data configuration for all models are slightly lower than

16 Security and Communication Networks

Table 7 The results of our experiments

Dataset DataConfiguration Model Evaluation Metrics ()

Accuracy Precision Recall F1-Score Hit Miss FAR Cost BDR BTNR g-mean MCC

SEA Dataset

SEADNN 9808 7626 8485 8033 8485 1515 128 2283 7625 9926 9152 7945

LSTM-RNN 9852 8230 8658 8439 8658 1342 090 1883 8233 9934 9263 8364CNN 9884 8777 8701 8739 8701 1299 059 1651 8772 9937 93 8678

SEA 1v49DNN 9654 9998 9643 9817 9643 357 048 647 9998 5204 9796 7064

LSTM-RNN 9786 9998 9779 9887 9779 221 038 448 9998 6370 987 7874CNN 9878 9999 9874 9936 9874 126 019 240 9999 7551 9927 8622

GreenbergDataset

GreenbergTruncated

DNN 9397 9223 8067 8606 8067 1933 204 3157 9222 9441 8889 8253LSTM-RNN 9472 9488 8153 8770 8153 1847 132 2639 9487 9468 897 8476

CNN 9543 9616 8353 8940 8353 1647 10 2247 9616 9524 9094 8686

GreenbergEnriched

DNN 9757 9692 9240 9461 9240 760 088 1288 9692 9775 957 9308LSTM-RNN 9798 9757 9360 9554 9360 640 070 1060 9756 9810 9641 9428

CNN 9860 9855 9533 9692 9533 467 042 719 9855 9861 9743 9603

PU Dataset

PU TruncatedDNN 810 9959 7861 8786 7861 2139 225 3489 9959 3949 8766 5463

LSTM-RNN 8219 9969 7989 8870 7989 2011 175 3061 9968 4110 886 5646CNN 8375 9974 8164 8979 8164 1836 150 2736 9973 4338 8968 5879

PU EnrichedDNN 9044 9984 8921 9423 8921 1079 10 1679 9984 5672 9398 7064

LSTM-RNN 9131 9988 9018 9478 9018 982 075 1432 9988 5908 9461 7261CNN 9375 9992 9293 9630 9293 707 050 1007 9992 6678 9616 7852

the corresponding values of SEA data configuration also Hitvalues are dramatically increased in SEA 1v49 for all modelsby 10-14 from those that are in the SEA data configurationThis is due to the structure of SEA 1v49 data configurationwhere there are 122500 masquerader blocks in the test setof SEA 1v49 comparing to only 231 blocks in the SEA dataconfiguration Moreover the FAR values of SEA 1v49 for allmodels are significantly lower than the corresponding valuesof SEA data configuration Hence regarding SEA datasetSEA 1v49 is better to use in masquerade detection than SEAdata configuration

On the other hand as we expected Greenberg Enrichedenhanced noticeably the performance of all models in termsof all used evaluation metrics from the corresponding val-ues of Greenberg Truncated data configuration This canbe explained by the fact that Greenberg Enriched dataconfiguration has more information about user behaviorincluding command name parameters aliases and flagscomparing to only command name in Greenberg TruncatedTherefore regarding Greenberg dataset Greenberg Enricheddata configuration is better to use in masquerade detectionthan Greenberg Truncated The same thing happened inPU dataset where its PU Enriched data configuration hasbetter results regarding all models than PU Truncated Thusregarding PU dataset PU Enriched is better to use inmasquerade detection than PUTruncated data configuration

Actually PU Truncated and Greenberg Truncated dataconfigurations simulate SEA and SEA 1v49 data configu-rations where only command name is considered Despitethat regarding all used models SEA 1v49 recorded thebest results among the other truncated data configurationsOn the other hand PU Enriched and Greenberg Enriched

are considered as enriched data configurations where extrainformation about users is taken into consideration Due tothat enriched data configurations help models to build userrsquosbehavior profile more accurately than with truncated dataconfigurations Regarding all models the results associatedwithGreenberg Enriched especially in terms ofAccuracyHitand FAR values are better than of the corresponding valuesof PU Enriched data configuration because PU dataset isvery small masquerade detection dataset with a relatively lownumber of users (only 8 users) Also this reason can explainwhy a few previous works used PU dataset in masqueradedetection However data configurations can be sort for allused models from the upper to lower according to theobtained results as follows SEA 1v49 Greenberg EnrichedPU Enriched SEA Greenberg Truncated and PUTruncated

For the sake of brevity and space limitation we selected asubset of the used performancemetrics inTable 7 to be shownvisually in Figures 9 and 10 Figures 9(a) 9(b) 9(c) 9(d)9(e) 9(f) 9(g) and 9(h) showAccuracy HitMiss FAR CostBDR F1-Score and MCC percentages of the used modelsin each data configuration respectively Figures 10(a) 10(b)10(c) 10(d) 10(e) and 10(f) show Accuracy Hit FAR BDRF1-Score and MCC percentages for the average performanceof the used models on datasets respectively Figures 9 and10 can give us a visual comparison of the performance of theused deep learning models for each data configuration anddataset as well as in all datasets

By taking an inspective look to Figures 9 and 10 we cannotice the stability of deep learning models in such a waythat they are enhancing masquerade detection from a dataconfiguration to another in a consistent pattern To explainthat we will discuss the obtained results from the perspective

Security and Communication Networks 17

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnriched

PU EnrichedPU Truncated

0102030405060708090

100

Accura

cy (

)

(a)

0102030405060708090

100

Hit

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(b)

0

5

10

15

20

25

Miss

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(c)

002040608

112141618

22224

FAR

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(d)

0

5

10

15

20

25

30

35

Cos

t (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(e)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU Truncated

0102030405060708090

100

BDR

()

PU Enriched

(f)

Figure 9 Continued

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 4: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

4 Security and Communication Networks

31 SEA Dataset Recently published papers that focused onmasquerade detection area used this dataset SEA (SchonlauEt Al) is a free UNIX command line-based dataset [7] Theyused UNIX acct audit tool to collect commands from 50different users for several months SEA dataset contains aset of 15000 commands for every user and these commandscontain only command names issued by that user For eachuser the set of 15000 commands is divided into 150 blockseach with 100 commands The first 50 blocks for each userare considered genuine and used as a training set Theremaining 100 blocks of each user are considered as a testset Some of the test blocks are contaminated randomly withdata of other users ie each user has varying masqueraderblocks in his test set from 0 to 24 blocks Two associateddata configurations have been used with this dataset in theliterature SEA and SEA 1v49

311 SEA This data configuration is proposed in the study[7] A separate classifier is built for each of the 50 users Wetrained each classifier to build two profiles one profile forself-behavior using the first 50 blocks of the particular userand the other profile for non-self-behavior using (49 times 50)training blocks of the other 49 users The test set of each userwill be the same as described in Section 31

312 SEA 1v49 In this configuration we followed the samemethodology proposed in research [9] A classifier is built foreach user and trained only with the first 50 training blocks ofits data On the other hand the test set for each user consistsof the first 50 training blocks of each of the other 49 usersresulting in 2450masquerade blocks in addition to its originalnormal blocks which vary between 76 and 100 blocks

32 Greenberg Dataset This dataset has been proposed in[16] and widely used in previous works It contains com-mands collected from 168 UNIX users that used csh shellUsers of this dataset are considered to be a member in one ofthe following four groups novice programmers experiencedprogrammers computer scientists and nonprogrammersThis dataset is enriched ie it has sessions for each userincluding information about start and end time of the sessionworking directory command names command parameterscommand aliases and an error flag Two associated dataconfigurations have been used with this dataset in theliterature Greenberg Truncated and Greenberg Enriched

321 Greenberg Truncated In this configuration we fol-lowed the same methodology conducted by [17] First weextracted the truncated command lines from Greenbergdataset which contain only the command names Next from168 users available inGreenberg dataset we selected randomly50 users who have between 2000 and 5000 commands to actas normal users Then we divided commands of each of the50 users into blocks each with 10 commands The first 100blocks of each user will be his training set whereas the next100 blocks will be used as a validation of self-behavior in histest set After that we randomly selected additional 25 usersfrom the remaining 118 users to act as masqueraders Thenfor each of the 50 normal users we selected randomly 30

blocks from masqueradersrsquo data and input them at randompositions in his test set which results in a total of 130 blocksfor testing

322 Greenberg Enriched It has the same methodologyexplained in Greenberg Truncated but with only one differ-ence that for this data configuration we extracted only theenriched command lines from Greenberg dataset Enrichedcommand linemeans a concatenation of command name andcommand parameters entered by the user together with anyalias employed As for Greenberg Truncated data configura-tion described above Greenberg Enriched data configurationhas for each of the 50 normal users 100 blocks for training and130 blocks for testing

33 PU Dataset Purdue University (PU) dataset has beenproposed in [19] It contains sanitized commands collectedfrom 8 different users at Purdue University over the courseof up to 2 years This dataset is enriched which meansthat it contains in addition to command names commandparameters flags and shell meta-characters Furthermorethis dataset has sessions for each of the 8 users In addition tothat data of each user is processed into a token stream Tokenhere means either command name or command parameterTwo associated data configurations have been used with thisdataset in the literature PU Truncated and PU Enriched

331 PU Truncated For this configuration we followed thesame methodology used in [19] First we extracted onlythe truncated tokens from PU dataset ie the tokens thatcontain only command names Next for each of the 8 usersavailable in PU dataset we divided his data into blocks eachof 10 tokens Then the first 150 blocks of each user will beconsidered as his training set After that the next 50 blocksfor each user will be used as a validation of self-behavior inhis test set To simulate masquerade activities we added foreach user other seven usersrsquo testing data (7times 50)which resultsin a total of 400 blocks of testing for each of the 8 users

332 PU Enriched It has the same methodology explainedin PU Truncated but with only one difference that forPU Enriched data configuration we extracted here only theenriched tokens ie all tokens from PU dataset As forPU Truncated data configuration described in Section 331PU Enriched data configuration has for each of the 8 users150 blocks for training and 400 blocks for testing Table 3summarizes all details about data configurations

4 DNN Hyperparameters Selection

In this section we will present a Particle SwarmOptimization-based algorithm to select the hyperparametersof Deep Neural Networks (DNN) This algorithm will helpus to proceed in our experiments to construct DNN formasquerades detection as will be explained in Section 51DNN is a multilayer Artificial Neural Network with manyhidden layers The weights of DNN are fully connectedie every neuron at any particular layer is connected to allneurons of the higher-order layer that is located adjacently

Security and Communication Networks 5

Table 3 The structure of the used data configurations

Characteristics Data Configurations

SEA SEA 1v49 GreenbergTruncated

GreenbergEnriched

PUTruncated PU Enriched

Number of users 50 50 50 50 8 8Block Size 100 100 10 10 10 10

Number of blocks forevery user

Training set 2500 50 100 100 150 150Test set 100 2526sim2550 130 130 400 400Total 2600 2576sim2600 230 230 550 550

Number of blocks forall users

Training set 125000 2500 5000 5000 1200 1200Test set 5000 127269 6500 6500 3200 3200Total 130000 129769 11500 11500 4400 4400

Distribution of thetraining set

Normal 2500 2500 5000 5000 1200 1200Masquerader 122500 0 0 0 0 0

Total 125000 2500 5000 5000 1200 1200

Distribution of thetest set

Normal 4769 4769 5000 5000 400 400Masquerader 231 122500 1500 1500 2800 2800

Total 5000 127269 6500 6500 3200 3200

1

2

m

I1

I2

Im

Input LayerHidden Layers

Output Layer

1

2

j

1

2

k

1

2

n

1 h

O1

O2

On

2 h-1

Figure 1 The basic structure of a typical DNN

to that particular layer [4] The information in DNN ispropagated in a feed-forward manner that is from inputs tooutputs via hidden layers Figure 1 depicts the basic structureof a typical DNN

DNNs are widely used in various machine learning tasksIn addition to that they have proved their ability to surpassmost of the machine learning techniques in terms of perfor-mance [22] However the performance of any DNN relieson the selection of the values of its hyperparameters DNNhyperparameters are defined as a set of critical parametersthat control the architecture behavior and performance ofthat DNN in the underlying machine learning task Indeedthere are two kinds of such hyperparameters global parame-ters and layer-based parameters The global parameters arethose that defined the general behavior of DNN such aslearning rate epochs number batch size number of layers

and the used optimizer On the other hand layer-basedparameters values are dependent on each layer in DNNExamples of layer-based parameters are but not limitedto type of layer weight initialization method activationfunction and a number of neurons

The problem is that these hyperparameters are varyingfrom task to task and they must be set before the trainingprocess One familiar solution to overcome this problem isto find an expert who is conversant with the underlyingmachine learning task to tune precisely the DNN hyper-parameters Unfortunately the existence of such expert isnot available in all cases Another possible solution is toadjust these hyperparameters manually in a trial-and-errormanner This can be handled by searching the space ofhyperparameters by executing either grid search or randomsearch [23 24] A grid search is performed upon definedranges of hyperparameters where those ranges are identifiedpreviously depending on a prior knowledge of the underlyingtask After that the user picks up values of hyperparam-eters from the predefined ranges consecutively and teststhe performance of DNN on the training set When allpossible combination of hyperparameters values is testedthe best combination is selected to configure DNN andtest it on the test set Random search is similar to gridsearch but instead of picking up hyperparameters valuesin a methodical manner the user selects hyperparametersvalues from those predefined ranges randomly In 2012 Snoeket al have proposed a hyperparameters selection methodbased on Bayesian optimization [25] In this method theuser improves his knowledge of selecting hyperparametersby using the information gained from any given experimentto decide how to adjust the hyperparameters for the nextexperiment Despite good results that have been obtainedby the grid random and Bayesian optimization searchesin some cases in general the complexity and large search

6 Security and Communication Networks

space of theDNNhyperparameters valuesmake suchmanualalgorithms infeasible and too exhausting searching process

Evolutionary Algorithms (EAs) are metaheuristic algo-rithms which perform excellently for finding the globaloptima of a nonlinear function especially when there aremultiple local minima or maxima EAs are considered asvery promising algorithms for solving the problem of DNNparameterization automatically In the literature there are alot of studies that have been proposed recently aiming at usingEAs in optimizing DNN hyperparameters in order to gain ahigh accuracy value as much as possible Genetic Algorithm(GA) which is one of the most famous EAs has been usedto optimize the network parameters and the Taguchi methodis applied between the crossover and mutation operatorsincluding initial weights definition [26] GAs also are usedin the pretraining step prior to the supervised step based ona multiclass classification task [27] Another approach usingGA to reduce the training time has been presented in [28]TheGA is used to enhanceDeepNeuralNetworks by evolvinga neural networkrsquos weights [29] An automated GA-basedapproach has been proposed in [30] that optimized DNNhyperparameters for malware classification tasks MoreoverParticle Swarm Optimization is also one of the most well-known and popular EAs Lorenzo et al used PSO andproposed two approaches the first is sequential and thesecond is parallel to optimize hyperparameters of any DNN[31 32] Then Nalepa and Lorenzo proved formally theconvergence abilities of the former two approaches and testedthem separately on a single workstation and a cluster ofsequential and parallel approaches respectively [33] FinallyF Ye proposed in 2017 an automatic PSO-based algorithmto select DNN hyperparameters in large scale and highdimensional data [34]Thus we decided to use PSO to enableus to select hyperparameters for DNN automatically Thenin Section 51 we will explain how to adapt this algorithmfor static classification experiments used in a masqueradedetection scenario Section 41 introduces a necessary andbrief preface reviewing how standard PSO is working Thenthe rest of this section presents our proposed PSO-basedalgorithm to optimize DNN hyperparameters

41 Particle Swarm Optimization Particle Swarm Optimiza-tion (PSO) is a metaheuristic algorithm for optimizing non-linear functions in continuous search space It was proposedby Eberhart and Kennedy in 1995 [35] PSO tries to mimicthe social behavior of animals The swarm concept is a setof many members which are called particles The numberof particles in the swarm is an integer value denoted by119878 and called swarm size Every particle in the particularswarm has two vectors of 119873 length where 119873 is the sizeof the problem defined variables (dimensions) The firstvector is called position vector denoted by 119875 that identifiesthe current position of that particle in the search space ofthe problem Each position vector can be considered as acandidate solution of the problem The second vector iscalled velocity vector denoted by 119881 that determines bothspeed and direction of that particle in the search space ofthe problem at next iteration During the execution of PSOanother two vectors at every iteration should be stored The

first is called personal best vector denoted by 119875119894119887119890119904119905 whichindicates the best position of the 119894th particle in the swarmthat has been explored so far Each particle in the swarm hasits independent personal best vector from the other particlesand it is updated at each iteration The second vector is theglobal best vector denoted by Gbest which indicates the bestposition that has been found over the swarm so far There isa single global best vector for all particles in the swarm andit is updated at every iteration It can be looked to personalbest vector as the cognitive knowledge of the particle whereasthe global best vector represents the social knowledge ofthe swarm Mathematically for each particle 119894 in the swarm119878 at each iteration 119905 the velocity 119881 and position 119875 vectorsare updated to next iteration t+1 according to (1) and (2)respectively

119881119894119905+1 = 119882119881119894119905 + 11986211199031 (119905) (119875119894119887119890119904119905 minus 119875119894119905)+ 11986221199032 (119905) (119866119887119890119904119905 minus 119875119894119905)

(1)

119875119894119905+1 = 119875119894119905 + 119881119894119905+1 (2)

119882 is the inertia weight constant which controls the impactof the velocity of the particle at the current iteration onthe next iteration so the speed and direction of the particleare adjusted in order not to let the particle to get outsidethe search space of the problem Meanwhile 1198621 and 1198622 areconstants and known as acceleration coefficients 1199031 and 1199032are random values uniformly distributed in [0 1] At thebeginning of every iteration new values of 1199031 and 1199032 arecomputed randomly and they are constants for all particles inthe swarm at that iteration The goal of using 1198621 1198622 1199031 and1199032 constants is to scale both the cognitive knowledge of theparticle and the social knowledge of the swarmon the velocitychanges So the new position vectors of all particles willapproach to the optimal solution of the problem accordinglyFigure 2 depicts the flowchart of the standard PSO

In brief the standard PSOworks as follows First the userenters some required inputs like swarm size (S) dimensionsof the particles (N) acceleration constants (1198621 1198622) inertiaweight constant (W) fitness function (F) to score particleperformance in the problem domain and the maximumnumber of iterations (119905119898119886119909) Next PSO initializes positionand velocity vectors with the specified dimensions for allparticles in the swarm randomly Then PSO initializes thepersonal best vector for each particle in the swarm withthe specified dimensions and sets them to very small valueFurthermore PSO initializes the global best vector of theswarm with the specified dimensions and sets it to very smallvalue PSO computes the fitness score for each particle usingthe fitness function and updates the personal best vectorsfor all particles and the global best vector of the swarmAfter that PSO starts the first iteration by computing 1199031 and1199032 randomly and then updates velocity and position vectorsfor each particle according to (1) and (2) respectively Inaddition to that PSO computes again the fitness score foreach particle according to the given fitness function andupdates the personal best vector for each particle if the fitnessscore of that particle at this iteration is bigger than the fitness

Security and Communication Networks 7

YesNo

Start

(2) Initialize P and Vvectors particlesSof

each lengthNof

(5) For all S particles Compute F(P) and update Pi

best(6) Update Gbest

(8) Compute r1(t) and r2(t)(9) For all S particles

Update V P F(P) and Pibest

(10) Update Gbest

(12) Check Stop Criterion

satisfied

(13) Output Gbest as the optimal solution

Terminate

maxWF t

(1) Input SN C1 C2 (3) Pibest larr minusinfin i larr 1 to S

(4) Gbest larr minusinfin

(7) t larr 1

(11) t larr t+1

Figure 2 The flowchart of the standard PSO

score of the personal best vector of that particle (119865(119875119894119905 ) gt119865(119875119894119887119890119904119905)) Also PSO updates the global best vector of theswarm if any of the fitness score of the personal best vectorof the particles is bigger than the fitness score of the globalbest vector of the swarm (119865(119875119894119887119890119904119905) gt 119865(119866119887119890119904119905) i=1 to S)Then PSO checks the stop criterion and if one is satisfiedPSO will output the global best vector as the optimal solutionand terminate Else PSO will proceed to the next iterationand repeat the same procedure described in the first iterationabove until the stop criterion is reached

The stop criterion is satisfied when either the trainingerror is smaller than a predefined value () or the maximumnumber of iteration is reached Finally PSO performs betterthan GA in terms of simplicity and generality [36] PSO issimpler than GA because it contains only one operator andeasy to implement Also the generality of PSO means thatPSO does not need any modifications to be applied to anyoptimization problem as well as it is faster to converge to theoptimal solutionwhich decreases the computations and savesthe resources

42 DNN Hyperparameters Selection Using PSO The selec-tion of the hyperparameters of DNN can be interpreted as anoptimization task hence the main objective is to minimizethe loss function L(MT) where 119872 is the DNN model and119879 is the training set To achieve this goal we selected PSOto be our optimization algorithm that outputs the vectorof the optimized hyperparameters 119867 that minimized theloss function 119871 after constructed DNN model 119872 which istuned by the hyperparameters 119867 and trained on the trainingset 119879 The fitness function of our PSO-based algorithm isa function 119865lowast 119877119873 997888rarr 119877 that maps a real-valued vectorof hyperparameters that has a length of N to a real-valuedaccuracy value of the trained DNN that is tuned by thathyperparameters vector and tested on the test set 119885 Inother words our PSO-based algorithm finds the optimalhyperparameters vector among all possible combinations ofhyperparameters which yields to maximize the accuracy ofthe trained DNN on the test set Furthermore to ensurethe generality of our PSO-based algorithm which meansto be independent of the DNN that will be optimized andbe adapted easily to any classification task using DNN wewill allow the user to select which hyperparameters want touse in his work Therefore the user is responsible for usingour algorithm to define the number of the hyperparameters

as well as the type and domain of each parameter Thedomain of a parameter is the set of all possible values ofthat parameter After that our PSO-based algorithm willuse a special built-in generator that depends on the numberand domains of the defined parameters to initialize all theparticles (hyperparameters vectors) in the swarm

During the execution of the proposed algorithm andat each iteration the validation process is involved in theproposed algorithm to validate the updated position andvelocity vectors to be appropriate to the predefined rangesof parameters Finally in order to reduce computations andconverge faster two different stop conditions are checkedsimultaneously at the end of each iteration The first occurswhen the fitness score of the global best vector increasedless than a threshold which is specified by the userThe aim of the former condition is to guarantee that theglobal best vector cannot be improved further even if themaximumnumber of iterations is not reached yetThe secondcondition happens when the maximum number of iterationsis carried out Either the first or the second condition issatisfied then the proposed algorithm outputs the global bestvector as the optimal solution 119867 and terminates the searchprocess Figure 3 shows the flowchart of our PSO-basedDNNhyperparameters selection algorithm

43 Algorithm Steps

Inputs Number of hyperparameters (N) swarm size (S)acceleration constants (1198621 1198622) inertia constant (W) max-imum value of velocity (119881119898119886119909) minimum value of velocity(V119898119894119899) maximum number of iterations (t119898119886119909) evolutionthreshold () training set (T) and test set (Z)Output The optimal solution HProcedure

Step 1 For klarr9978881 to NLet h119896 be the k119905ℎ hyperparameterIf domain of h119896 is continuous then

let 119861119896119897119900119908 be the lower bound of h119896 and 119861119896119906119901be the upper bound of h119896

let user enter the lower and upper boundsof a hyperparameter h119896

End of if

8 Security and Communication Networks

(4) Initialize P and V vectors of Sparticles each of N length

(8) For all S particles

(12) For all S particles(16) Output

Yes

Terminate

Start User

(2) Define Domains for hk

(3) Create Hyper-parameters amp velocity generator

(1) Preprocessing Phase (2) Initialization Phase (3) Evolution Phase (4) Finishing Phase

No (15) Check Stop conditions

satisfied

(1) Input N S Vmin Vmax

klarr1 to N

(5) Input T Z C1 C2 W tmax

(6) Pibest larrminusinfin i larr1 to S(7) Gbest larr minusinfin

Compute Flowast(P) and update Pibest

(9) Update Gbest

(10) tlarr1

Compute V P Flowast(P) and Pibest

(13) Update Gbest

(14) tlarrt+1

(11) Compute r1(t) and r2(t)H larr Gbest

Figure 3 The flowchart of the proposed algorithm

Else

Let Y119896 be the set of all possible values of h119896

Let user enter all elements of the set Y119896

End of elseEnd of for

Step 2 Let 119865lowast be the fitness function which constructs DNNtuned with the given hyperparameters then trains DNN on119879 and tests it on 119885 Finally 119865lowast computes the accuracy ofDNN as output

Step 3 Let G119887119890119904119905 be the global best vector of the swarm oflength N

Let GS be the best fitness score of the swarmGSlarr997888 minusinfin

Step 4 For ilarr9978881 to SLet P119894 be the position vector of the 119894th particle oflength NLet V 119894 be the velocity vector of the 119894th particle oflength NLet 119875119894119887119890119904119905 be the personal best vector of the 119894thparticle of length NLet PS119894 be the fitness score of the personal bestvector of the 119894th particleFor jlarr9978881 to N

If domain of h119895 is continuous thenselect h119895 uniformly distributed

119875[119895] larr997888 U(119861119895119897119900119908

119861119895119906119901)End of ifElse

Select h119895 randomly by 119875119894[j] larr997888RAND (Y119895)

End of else119881119894[119895] larr997888 U(119881119898119894119899 119881119898119886119909)

End of for119875119894119887119890119904119905 larr997888 119875119894Let FS119894 be the fitness score of the 119894th particle

119865119878119894 larr997888 119865lowast(119875119894)119875119878119894 larr997888 119865119878119894If FS119894 gt GS then

119866119887119890119904119905 larr997888 119875119894119866119878 larr997888 119865119878119894

End of ifEnd of for

Step 5 Let GS119901119903119907 be the previous best fitness score of theswarm

119866119878119901119903V larr997888 119866119878Let 1199031 and 1199032 be random values in PSOLet 119905 be the current iterationFor tlarr9978881 to t119898119886119909

1199031 larr997888 119880(0 1)1199032 larr997888 119880(0 1)For ilarr997888 1 to S

Update V 119894 according to (1)Update P119894 according to (2)119865119878119894 larr997888 119865lowast(119875119894)If FS119894 gt PS119894 then119904119904119904119875119894119887119890119904119905 larr997888 119875119894119875119878119894 larr997888 119865119878119894End of ifIf PS119894 gt GS then119866119887119890119904119905 larr997888 119875119894119887119890119904119905119866119878 larr997888 119875119878119894End of if

End of forIf 119866119878- 119866119878119901119903V lt then

go to Step 6End of if

Security and Communication Networks 9

Table 4 PSO parameters recommended values or ranges

Parameter ValueRange119878 [5 20]119881119898119894119899 0119881119898119886119909 11198621 21198622 2119882 [04 09]119905119898119886119909 [30 50] 00001

119866119878119901119903V larr997888 119866119878End of for

Step 6 Let119867 be the optimal hyperparameters vector119867 larr997888 119866119887119890119904119905Return119867 and Terminate

44 PSO Parameters Selection of the value of PSO param-eters (S V119898119886119909 V119898119894119899 1198621 1198622 W t119898119886119909 ) is a very complexprocess Fortunatelymany empirical and theoretical previousstudies have been published to solve this problem [37ndash40] They introduced some recommended values of PSOparameters which can be taken Table 4 shows every PSOparameter and the corresponding recommended value orrange Thus for those parameters which have recommendedranges we can select a value for each parameter from its rangerandomly and fix it as a constant during the execution of PSO

5 Experimental Setup and Models

This section explains the methodology of performing ourempirical experiments as well as the description of deeplearning models which we used to detect masquerades Asmentioned in Section 3 we selected three UNIX commandline-based datasets (SEA Greenberg PU) Each of thesedatasets is a collection of text files inwhich each text file repre-sents a userThe text file of each user in the particular datasetcontains a set of UNIX commands that are issued by that userThis reflects the fact that these datasets do not contain anyreal masqueraders However to simulate masqueraders andto use these datasets in masquerade detection special dataconfigurations must be implemented prior to proceeding inour experiments According to Section 3 and its subsectionseach dataset has its two different types of data configurationsTherefore we obtained six data configurations that each onewill be observed separately which yields in the result to sixindependent experiments for each model Finally masquer-ade detection can be applied to these data configurationsby following two different main approaches namely staticclassification and dynamic classificationThe two subsequentsubsections present the difference between them as well aswhich deep learning models are exploited for each one

51 Static Classification Approach In the static classificationapproach the classification task is carried out using a dataset

of samples which are represented by a set of static features[30] These static features are defined according to the natureof the task where the classification will be applied In additionto that the dataset samples or also called observations arecollected manually by some experts working in the field ofthat classification task After that these samples are split intotwo independent sets known as training and test sets to trainand test the selected model respectively Static classificationapproach has pros and cons as well Although it provides afaster and easier solution it requires a ready-to-use datasetwith static features The existence of such dataset might notbe available in some complex classification tasks Hence theattempt to create a dataset with static features will be a hardmission In our work we decided to utilize the existenceof three famous UNIX command line-based datasets toimplement six different data configurations Each user inthe particular data configuration has a specific number ofblocks which are represented by a set of static featuresIndeed these features are the userrsquos UNIX commands incharge of describing the behavior of that user and laterhelping the classifier to detect masquerades We decided touse two well-known deep learning models namely DeepNeural Networks (DNN) and Recurrent Neural Networks(RNN) to accomplish the staticmasquerade detection task onthe implemented six data configurations

511 Deep Neural Networks In Section 4 we explained indetail the DNN structure and the problem of the selection ofits hyperparameters We also proposed PSO-based algorithmto obtain the optimal hyperparameters vector thatmaximizedthe accuracy of the DNN on the given training and test setsIn this subsection we describe how we utilized the proposedPSO-based algorithm and the DNN in static masqueradedetection task using six of data configurations which areSEA SEA 1v49 Greenberg Truncated Greenberg EnrichedPU Truncated and PU Enriched Every data configurationof them has its structure and a specific number of users asdescribed in Section 3 So we will have six separate DNN-experiments and each experiment will be on one of the dataconfigurations

The methodology of our DNN-experiments consists offour consecutive stages which are initialization optimiza-tion results extraction and finishing stages The first stageis to initialize all required operating parameters as well asto prepare the particular data configurationrsquos files in whicheach file represents a user in that data configurationThe userfile consists of the training set followed by the test set of thatuser We set all PSO parameters for all DNN-experiments asfollows S=20 V119898119894119899=0 V119898119886119909= 1 1198621=1198622=2 W=09 t119898119886119909=30and =10minus4 Then the last step in the initialization stage is todefine hyperparameters of the DNN and their domains Weused twelve different DNN hyperparameters (N=12) Table 5shows each DNN hyperparameter and its correspondingdefined domain All the used hyperparameters are numericalexcept that Optimizer Layer type Initialization function andActivation function hyperparameters are categorical In thiscase a list of all possible values is indexed to a sequenced-numbered range from 1 to the length of that list Optimizerlist includes elements Adagrad Nadam Adam Adamax

10 Security and Communication Networks

Table 5 The used DNN hyperparameters and their domains

Hyperparameter Domain DescriptionLearning rate [001 09] ContinuousMomentum [01 09] ContinuousDecay [0001 001 ContinuousDropout rate [01 09] ContinuousNumber of hidden layers [1 10] Discrete with step=1Numbers of neurons of hidden layers [1 100] Discrete with step=1Number of epochs [5 20] Discrete with step=5Batch size [100 1000] Discrete with step=50Optimizer [1 6] Discrete with step=1Initialization function [1 8] Discrete with step=1Layer type [1 2] Discrete with step=1Activation function [1 8] Discrete with step=1

RMSprop and SGD Layer type list contains two elementswhich are Dropout and Dense Initialization function listincludes elements Zero Normal Lecun uniform UniformGlorot uniform Glorot normal He uniform and He normalFinally Activation list has eight elements which are LinearSoftmax ReLU Sigmoid Tanh Hard Sigmoid Softsign andSoftplus It is worth mentioning that the elements of all cate-gorical hyperparameters are defined inKeras implementation[30]

The optimization and results extraction stages will beperformed once for each user in the particular data configu-ration that is they will be repeated for each user119880119894 i=12 M where 119872 is the number of users in the particular dataconfiguration119863The optimization stage starts by splitting thedata of the user119880119894 into two independent sets119879119894 and119885119894 whichare the training and test sets of the ith user respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 All blocks ofthe training and test sets are converted from text to numericvalues and then are normalized in [0 1] After that wesupplied these sets to the proposed PSO-based algorithm tofind the optimized hyperparameters vector119867119894 for the ith userIn addition to that we will save a copy of 119867119894 values in adatabase in order to save time and use them again in theRNN-experiment of that particular data configuration D aswill be presented in Section 512 The results extraction stagetakes place when constructing the DNN that is tuned by 119867119894trains the DNN on 119879119894 and tests the DNN on119885119894 The values ofthe classification outcomes True Positive (TP119894) False Positive(FP119894) True Negative (TN 119894) and False Negative (FN 119894) for theith user in the particular data configuration 119863 are extractedand saved for further processing later

Then the next user is observed and same procedure ofoptimization and results extraction stages is performed tillthe last user in the particular data configuration119863 is reachedFinally when all users in the particular data configurationare completed the last stage (finishing stage) is executedFinishing stage computes the summation of all obtained TPsof all users in the particular data configuration 119863 denotedby TP The same process will be applied also to the otheroutcomes namely FP TN and FN Equations (3) (4)

(5) and (6) express the formulas of TP FP TN and FNrespectively

119879119875 = 119872sum119894=1

119879119875119894 (3)

119865119875 = 119872sum119894=1

119865119875119894 (4)

119879119873 = 119872sum119894=1

119879119873119894 (5)

119865119873 = 119872sum119894=1

119865119873119894 (6)

The finishing stage will report and save these outcomes andend the DNN-experiment for the particular data configura-tion 119863 The former outcomes will be used to compute tenwell-known evaluation metrics to assess the performanceof the DNN on the particular data configuration D as willbe presented in Section 6 It is worth saying that the sameprocedure which is explained above will be done for eachdata configuration Figure 4 depicts the flowchart of themethodology of the DNN-experiments

512 Recurrent Neural Networks TheRecurrent Neural Net-work is a special type of the traditional feed-forwardArtificialNeural Network Unlike traditional ANN in the RNN eachneuron in any of the hidden layers has additional connectionsfrom its output to itself (self-recurrent) as well as to otherneurons of the same hidden layer Therefore the output ofthe RNNrsquos hidden layer at any time step (t) is for the currentinputs and the output of the hidden layer at the previous timestep (t-1) In RNN these directed cycles allow informationto circulate in the network and make the hidden layers asthe storage unit of the whole network [41] The importantcharacteristics of the RNN are the capability to have memoryand generate periodical sequences

Despite that the conventional RNN structure which isdescribed above has a serious problem especially when the

Security and Communication Networks 11

(9) Construct DNN that is tuned by Hi

(10) Train DNN on Ti

(11) Test DNN on Zi

No

(16) OutputTP FP TN and FN

Yes

End

Start

(1) Input Data configuration D M

(2) Set PSO parameters values

(3) Define Domains for Hyper-parameters

(3) Results Extraction Stage (4) Finishing Stage(1) Initialization Stage (2) Optimization Stage

(6) Execute the proposed PSO-based algorithm

(15) Compute and save TP FP TN and FN for D

(8) Database

(4) ilarr1

(7) Obtain Hi of the user Ui

(5) Create Ti and Zi sets of the user Ui

Hi

(12) Obtain and save TPi FPi TNi andFNi for the user Ui

(14) Is i gt M

(13) ilarri+1

Figure 4 The flowchart of the DNN-experiments

Inputxt

it ctℎt

Outputot

ft

Figure 5 The structure of an LSTM cell [6]

RNN is trained using the back-propagation technique Theproblem is known as gradient vanishing and exploding [42]The gradient vanishing problem occurs when the gradientsignal gets so small over the network which causes learningto become very slow or stop On the other hand the gradientexploding problem occurs when the gradient signal gets solarge in which learning divergesThis problem of the conven-tional RNN limited the use of the RNN to be only in short-termmemory tasks To solve this problem a new architectureof RNN is proposed by Hochreiter and Schmidhuber [43]known as Long Short-Term Memory (LSTM) LSTM uses anew structure called a memory cell that is composed of fourparts which are an input gate a neuron with a self-recurrentconnection a forget gate and the output gateMeanwhile themain goal of using a neuron with a self-recurrent connectionis to record information the aim of using three gates is tocontrol the flow of information from or into the memory cellThe input gate decides if to allow the incoming informationto enter into the memory cell or block it Moreover the forgetgate controls if to pass the previous state of the memory cellto alter the current state of the memory cell or prevent itFinally the output gate determines if to pass the output ofthe memory cell or not Figure 5 shows the structure of anLSTM memory cell Rather than overcoming the problemsof the conventional RNN LSTM model also outperformsthe conventional RNN in terms of performance especially inlong-term memory tasks [5] The LSTM-RNN model can beobtained by replacing every neuron in the hidden layers ofthe RNN to an LSTMmemory cell [6]

In this study we used the LSTM-RNN model to performa static masquerade detection task on all data configurationsAs mentioned in Section 511 there are six data config-urations and each of them will be used in the separate

experiment So we will have six separate LSTM-RNN-experiments each experiment will be on one of the dataconfigurations The methodology of all of these experimentsis the same and as follows for the given data configurationD we firstly prepared all the given data configurationrsquos filesby converting all blocks from text to numerical values andthen normalizing them in [0 1] Next to that for each user119880119894 in D where i=12 M and 119872 is the number of users inD we did the following steps we split the data of 119880119894 into twoindependent sets 119879119894 and 119885119894 which are the training and testsets of the ith user in D respectively The splitting processfollowed the structure of the particular data configurationwhich is described in Section 3 After that we retrieved thestored optimized hyperparameters vector of the ith user (119867119894)from the database which is created in the previous DNN-experiments Then we constructed the RNN model that istuned by119867119894 In order to obtain the LSTM-RNNmodel everyneuron in any of the hidden layers is replaced to an LSTMmemory cell The constructed LSTM-RNN model is trainedon119879119894 and then tested on119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 6 depicts theflowchart of the methodology of LSTM-RNN-experiments

52 Dynamic Classification Approach In contrast of staticclassification approach dynamic classification approach doesnot need a ready-to-use dataset with static features [30] Itcovenants directly with raw data sources such as text imagevideo sound and signal files and extracts features from themdynamically The models that use this approach try to learnand represent features in unsupervised manner Then thesemodels train themselves using the extracted features to beable to classify unseen dataThe deep learningmodels fit verywell for this approach because the main objectives of deeplearning models are the strong ability of automatic featureextraction and self-learning Rather than that dynamicclassification models overcome the problem of the lake ofdatasets it performs more efficient than the static classifica-tionmodels Despite these advantages dynamic classificationapproach has also drawbacks Dynamic classification modelsare slower and take a long time to train if compared with

12 Security and Communication Networks

YesNo

Hi

Start

(1) InputData configuration D M

(2) Prepare files of D

(4) Split data of Ui

into Ti and Zi sets

(7) Train LSTM-RNN model on Ti

(8) Test LSTM-RNN model on Zi

End

(5) Database

(6) Construct LSTM-RNN model that is tuned by Hi

(3) ilarr1

(9) Obtain and save TPi FPi TNi andFNi for the user Ui

(10) ilarri+1

(11) Is i gt M

(13) Output TPFP TN and FN

(12) Compute andsave TP FP TN

and FN for D

Figure 6 The flowchart of the LSTM-RNN-experiments

static classification models due to complex deep structure ofthesemodels as well as the huge amount of computations thatare required to execute Furthermore dynamic classificationmodels require a very large amount of input samples to gainhigh accuracy values

In this research we used six data configurations that areimplemented from three textual datasets In order to applydynamic masquerade detection on these data configurationswe need amodel that is able to extract features from the userrsquoscommand text file dynamically and then classify the user intoone of the two classes that will be either a normal user or amasqueraderTherefore we dealwith a text classification taskThe text classification is defined as a task that assigns a pieceof text (a word a sentence or even a document) to one ormore classes according to its content Indeed there are threetypes of text classification namely sentence classificationsentiment analysis and document categorization In sentenceclassification a given sentence should be assigned correctlyto one of possible classes Furthermore sentiment analysisdetermines if a given sentence is a positive negative orneutral towards a specific subject In contrast documentcategorization deals with documents and determines whichclass from a given set of possible classes a document belongsto According to the nature of dynamic classification as well asthe functionality of text classification deep learning modelsare the fittest among the other machine learning models forthese types of classification due to their powerful capability offeatures learning

A wide range of researches have been accomplished inthe literature in the field of text classification using deeplearning models It was started by LeCun et al in 1998 whenthey proposed a special topology of the Convolutional NeuralNetwork (CNN) known as LeNet family and used it in textclassification efficiently [44]Then various studies have beenpublished to introduce text classification algorithms as wellas the factors that impact the performance [45ndash47] In thestudy [48] the CNNmodel is used for sentence classificationtask over a set of text dataset benchmarks A single one-dimensional CNN is proposed to learn a region-based textembedding [49] X Zhang et al introduced a novel character-based multidimensional CNN for text classification taskswith competitive results [50] In the research [51] a newhierarchal approach calledHierarchal Deep Learning for Text

classification (HDLTex) is proposed and three deep struc-tures which are DNN RNN and CNN are used A recurrentconvolutional network model is introduced [52] for textclassification and high results are obtained on documents-level datasets A novel LSTM-based model is introduced andused for text classification withmultitask learning framework[53] The study [54] proposed a new model called hierarchalattention network for document classification and is testedon six large document-level datasets with good results Acharacter-level text representations approach is proposed andtested for text classification tasks using deep CNN [55]As noticed the CNN is the mostly used deep learningmodel for text classification tasks So we decided to use theCNN to perform dynamic masquerade detection on all dataconfigurations The following subsection reviews the CNNand explains the structure of the used CNN model and themethodology of our CNN-experiments

521 Convolutional Neural Networks The ConvolutionalNeural Network (CNN) is a deep learning model whichis biological-inspired from the animal visual cortex TheCNN can be considered as a special type of the traditionalfeed-forwardArtificial Neural NetworkThemajor differencebetween ANN and CNN is that instead of the fully connectedarchitecture of ANN the individual neurons in CNN areconnected to subregions of the input field The neurons ofthe CNN are arranged in such a way they are tilled to coverthe entire input field The typical CNN consists of five maincomponents namely an input layer the convolutional layerthe pooling layer the fully connected layer and an outputlayer The input layer is where the input data is enteredinto the CNN The first convolutional layer in the CNNconsists of individual neurons that each of them is connectedto a small subset of the input field The neurons in thenext convolutional layers connect only to a subset of theirpreceding pooling layerrsquos outputMoreover the convolutionallayers in the CNN use a set of learnable kernels or filters thateach filter is applied to the specified subset of their precedinglayerrsquos output These filters calculate feature maps in whicheach feature map shares the same weights The poolinglayer also known as a subsampling layer is a nonlineardownsampling function that condenses subsets of its inputThemain goal of using pooling layers in the CNN is to reduce

Security and Communication Networks 13

Userrsquos Command Text Files

Quantization

Input Layer

Convolutional layer

C1 features map P1 features map

Max-Pooling layer

C2 P2 C6 P6

Fully-Connected dropout layers

2048 sigmoid neurons

2048 sigmoid neurons 2

softmaxneurons

Outputdense layer

0 (Normal)1 (Masquerader)

Figure 7 The architecture of the used CNNmodel

the complexity and computations by reducing the size of theirpreceding layerrsquos output There are many pooling nonlinearfunctions that can be used but among them max-poolingis the mostly used which selects the maximum value in thegiven pooling window Typically each convolutional layer inthe CNN is followed by a max-pooling layer The CNN hasone or more stacked convolutional layer and max-poolinglayer pairs to extract features from the entire input and thenmap these features to their next fully connected layerThe toplayers of the CNN are one or more of fully connected layerswhich are similar to hidden layers in the DNN This meansthat neurons of the fully connected layers are connected to allneurons of the preceding layer The output layer is the finallayer in the CNN and is responsible for reporting the outputvalue of the CNN Finally the back-propagation algorithm isusually used to train CNNs via Stochastic Gradient Decent(SGD) to adjust the weights of the fully connected layers [56]There are several variant structures of CNN that are proposedin the literature but LeNet structure which is proposed byLeCun et al [44] is themost common approach used inmanyapplications of computer vision and text classification

Regarding its stability and high efficiency in text clas-sification we selected the CNN model which is proposedin [50] to perform a dynamic masquerade detection on alldata configurationsThe usedmodel is a character-level CNNthat takes a text file as input and outputs the classificationscore (0 if the input text file is related to a normal user or1 otherwise) The used CNN model is from LeNet familyand consists of an input layer followed by six convolutionand max-pooling pairs followed by two fully connectedlayers and finally followed by an output layer In the inputlayer the text quantization process takes place when theused model encodes all letters in the input text file using aone-hot representation from a 70-character alphabet All theconvolutional layers in the used CNN model have a ReLUnonlinear activation functionThe two fully connected layersin the used CNN model are of the type dropout layer withdropout probability equal to 05 In addition to that the twofully connected layers in the usedCNNmodel have a Sigmoidnonlinear activation function as well as they have the samesize of 2048 neurons of each The output layer in the usedCNN model is of the type dense layer as well as it has asoftmax activation function and size of two neurons Theused CNN model is trained by back-propagation algorithmvia SGD Finally we set the following parameters to the

used CNN model learning rate=001 epochs=30 and batchsize=64 These values are obtained experimentally by per-forming a grid search to find the best possible values of theseparameters Figure 7 shows the architecture of the used CNNmodel and is reproduced from Zhang et al (2015) [under theCreative Commons Attribution Licensepublic domain]

In our work we used a CNNmodel to perform a dynamicmasquerade detection task on all data configurations Asmentioned in Section 511 there are six data configurationsand each of them will be used in the separate experimentSo we will have six separate CNN-experiments and eachexperiment will be on one of the data configurations Themethodology of all of these experiments is the same and asfollows for the given data configurationD we firstly preparedall the given data configurationrsquos text files such that each file ofthem represents the training and test sets of a user in119863 Nextto that for each user 119880119894 in D where i=12 M and119872 is thenumber of users in D we did the following steps we split thedata of 119880119894 into two independent sets 119879119894 and 119885119894 which are thetraining and test sets of the ith user in D respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 Furthermorewe also moved each block in the training and test sets of theuser 119880119894 to a separate text file This means that each of thetraining and test sets of the user 119880119894 consists of a specifiednumber of text files in which each text file contains one blockof UNIX commands After that we constructed the usedCNN model The constructed CNN model is trained on 119879119894and then tested on 119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 8 depicts theflowchart of the methodology of CNN-experiments

6 Results and Discussion

We carried out three major empirical experiments whichareDNN-experiments LSTM-RNN-experiments andCNN-experiments Each of them consists of six separate subex-periments where each subexperiment is performed on oneof the data configurations SEA SEA 1v49 Greenberg Trun-cated Greenberg Enriched PU Truncated and PU Enriched

14 Security and Communication Networks

YesNo

Start

(1) Input

(2) Prepare text files of D

(4) Split data of Ui

Ti and Zi text sets(6) Construct the used CNN model

(7) Train CNN model on Ti

(8) Test CNN model on Zi

(13) Output TP FP TN and FNEnd

(5) Move each block in Ti and Zi to a separate text file

Data configuration D M

(3) ilarr1

(12) Compute and save TPFP TN and FN for D

(9) Obtain and save TPi FPi TNiand FNi for the user Ui

(11) Is i gt M

(10) ilarri+1

into

Figure 8 The flowchart of the CNN-experiments

Table 6 The confusion matrix of the masquerade detection out-comes

Actual Class Predicted ClassNormal User Masquerader

Normal User TN FPMasquerader FN TP

Basically our PSO-based DNN hyperparameters selectionalgorithmwas implemented in Python 364 [57]withNumPy[58] Moreover all models (DNN LSTM-RNN CNN) wereconstructed and trained and tested based on Keras [59 60]with TensorFlow 16 [61 62] that backend over CUDA 90[63] and cuDNN 70 [64] In addition to that all experimentswere performed on a workstation with an Intel Core i7 CPU(38GHz 16 MB Cache) 16GB of RAM and theWindows 10operating system In order to accelerate the computations inall experiments we also used a GPU-accelerated computingwith NVIDIA Tesla K20 GPU 5GB GDDR5The experimen-tal environment is processed in 64-bit mode

In any classification task we have four possible outcomesTrue Positive (TP) True Negative (TN) False Positive (FP)and False Negative (FN) We get a TP when a masqueraderis correctly classified as a masquerader Whenever a gooduser is correctly classified as a good user itself we say it isa TN A FP occurs when a good user is misclassified as amasquerader In contrast FN occurs when a masqueraderis misclassified as a good user Table 6 shows the ConfusionMatrix of the masquerade detection outcomes For eachdata configuration we used the obtained outcomes for thatdata configuration to compute twelve well-known evaluationmetrics After that by using these evaluation metrics weassessed the performance of each deep learningmodel on thatdata configuration

For simplicity we divided these evaluation metrics intotwo categories General Classification Measures and Mas-querade Detection Measures The General ClassificationMeasures are metrics that are used for any classification tasknamely Accuracy Precision Recall and F1-Score On theother handMasquerade DetectionMeasures are metrics thatusually are used for a masquerade or intrusion detection

task which are Hit Rate Miss Rate False Alarm RateCost Bayesian Detection Rate Bayesian True Negative RateGeometric Mean andMatthews Correlation CoefficientTheused evaluation metrics definition and their correspondingequations are as follows

(i) Accuracy shows the rate of true detection over all testsets

119860119888119888119906119903119886119888119910 = 119879119875 + 119879119873119879119875 + 119879119873 + 119865119875 + 119865119873 (7)

(ii) Precision shows the rate of correctly classified mas-queraders from all blocks in the test set that areclassified as masqueraders

119875119903119890119888119894119904119894119900119899 = 119879119875119879119875 + 119865119875 (8)

(iii) Recall shows the rate of correctly classified masquer-aders over all masquerader blocks in the test set

119877119890119888119886119897119897 = 119879119875119879119875 + 119865119873 (9)

(iv) F1-Score gives information about the accuracy of aclassifier regarding both Precision (P) and Recall (R)metrics

1198651 119878119888119900119903119890 = 21119875 + 1119877 (10)

(v) Hit Rate shows the rate of correctly classified mas-querader blocks over all masquerader blocks pre-sented in the test set It is also called Hits TruePositive Rate or Detection Rate

119867119894119905 119877119886119905119890 = 119879119875119879119875 + 119865119873 (11)

(vi) Miss Rate is the complement of Hit Rate (Miss=100-Hit) ie it shows the rate of masquerade blocksthat are misclassified as a normal user from allmasquerade blocks in the test set It is also calledMisses or False Negative Rate

119872119894119904119904 119877119886119905119890 = 119865119873119865119873 + 119879119875 (12)

Security and Communication Networks 15

(vii) False Alarm Rate (FAR) gives information about therate of normal user blocks that are misclassified as amasquerader over all normal user blocks presented inthe test set It is also called False Positive Rate

119865119886119897119904119890 119860119897119886119903119898 119877119886119905119890 = 119865119875119865119875 + 119879119873 (13)

(viii) Cost is a metric that was proposed in [9] to evaluatethe efficiency of a classifier concerning bothMiss Rate(MR) and False Alarm Rate (FAR) metrics

119862119900119904119905 = 119872119877 + 6 times 119865119860119877 (14)

(ix) Bayesian Detection Rate (BDR) is a metric basedon Base-Rate Fallacy problem which is addressedby S Axelsson in 1999 [65] Base-Rate Fallacy is abasis of Bayesian statistics and occurs when peo-ple do not take the basic rate of incidence (Base-Rate) into their account when solving problems inprobabilities Unlike Hit Rate metric BDR shows therate of correctly classified masquerader blocks overall test set taking into consideration the base-rate ofmasqueraders Let I and Ilowast denote a masquerade anda normal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectivelyThen BDR can be computed asthe probability P(I | A) according to (15) [65]119861119886119910119890119904119894119886119899 119863119890119905119890119888119905119894119900119899 119877119886119905119890 = 119875 (119868 | 119860)

= 119875 (119868) times 119875 (119860 | 119868)119875 (119868) times 119875 (119860 | 119868) + 119875 (119868lowast) times 119875 (119860 | 119868lowast)(15)

P(I) is the rate of the masquerader blocks in the testset P(A | I) is the Hit Rate P(Ilowast) is the rate of thenormal blocks in the test set and P(A | Ilowast) is the FAR

(x) Bayesian True Negative Rate (BTNR) is also basedon Base-Rate Fallacy and shows the rate of trulyclassified normal blocks over all test set in which thepredicted normal behavior indicates really a normaluser [65] Let I and Ilowast denote a masquerade and anormal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectively Then BTNR can be computedas the probability P(Ilowast | Alowast) according to (16) [65]

119861119886119910119890119904119894119886119899 119879119903119906119890 119873119890119892119886119905119894V119890 119877119886119905119890 = 119875 (119868lowast | 119860lowast)= 119875 (119868lowast) times 119875 (119860lowast | 119868lowast)

119875 (119868lowast) times 119875 (119860lowast | 119868lowast) + 119875 (119868) times 119875 (119860lowast | 119868)(16)

P(Ilowast) is the rate of the normal blocks in the test setP(Alowast | Ilowast) is the True Negative Rate which is easilyobtained by calculating (1-FAR) P(I) is the rate of themasquerader blocks in the test set and P(Alowast | I) isthe Miss Rate

(xi) Geometric Mean (g-mean) is a performance metricthat combines true negative rate and true positive

rate at one specific threshold where both the errorsare considered equal This metric has been usedby several researchers for evaluating classifiers onimbalance dataset [66] It can be computed accordingto (17) [67]

119892 119898119890119886119899 = radic 119879119875 times 119879119873(119879119875 + 119865119873) times (119879119873 + 119865119875) (17)

(xii) Matthews Correlation Coefficient (MCC) is a perfor-mance metric that takes into account true and falsepositives and negatives and is generally regarded asa balanced measure which can be used even if theclasses are of very different sizes (imbalance dataset)[68] MCC has a range of minus1 to 1 where minus1 indicates acompletely wrong binary classifier while 1 indicates acompletely correct binary classifier Unlike the othermetrics discussed aboveMCC takes all the cells of theConfusion Matrix into consideration in its formulawhich can be computed according to (18) [69]

119872119862119862= (119879119875 times 119879119873) minus (119865119875 times 119865119873)radic(119879119875 + 119865119873) times (119879119875 + 119865119875) times (119879119873 + 119865119875) times (119879119873 + 119865119873)

(18)

In the following two subsections we will present our experi-mental results and explain them using two kinds of analysesperformance analysis and ROC curves analysis

61 Performance Analysis The effectiveness of any modelto detect masqueraders depends on its values of evaluationmetrics The higher values of Accuracy Precision RecallF1-Score Hit Rate Bayesian Detection Rate Bayesian TrueNegative Rate Geometric Mean and Matthews CorrelationCoefficient as well as the lower values of Miss Rate FalseAlarm Rate and Cost indicate an efficient classifierThe idealclassifier hasAccuracy andHit Rate values that reach 1 as wellasMiss Rate and False AlarmRate values that reach 0 Table 7presents the percentages of the used evaluation metricsfor DNN-experiments LSTM-RNN-experiments and CNN-experiments Actually the rows labeled by DNN and LSTM-RNN in Table 7 show results of the static masquerade detec-tion by using DNN and LSTM-RNN models respectivelywhereas the rows labeled by CNN in Table 7 show resultsof the dynamic masquerade detection by using CNN modelFurthermore the bold rows represent the best results amongthe same data configuration whereas the underlined valuesare the best for all data configurations

First of all the impact of using our PSO-based algorithmcan be seen in the obtained results of both DNN and LSTM-RNN models The PSO-based algorithm is used to optimizethe selection of DNN hyperparameters that maximized theaccuracy which means that the sum of TP and TN outcomeswill be increased significantly Thus according to (11) and(13) increasing the sum of TP and TN will lead definitelyto the increase of the value of Hit as well as to the decreaseof the value of FAR Although the accuracy values of SEA1v49 data configuration for all models are slightly lower than

16 Security and Communication Networks

Table 7 The results of our experiments

Dataset DataConfiguration Model Evaluation Metrics ()

Accuracy Precision Recall F1-Score Hit Miss FAR Cost BDR BTNR g-mean MCC

SEA Dataset

SEADNN 9808 7626 8485 8033 8485 1515 128 2283 7625 9926 9152 7945

LSTM-RNN 9852 8230 8658 8439 8658 1342 090 1883 8233 9934 9263 8364CNN 9884 8777 8701 8739 8701 1299 059 1651 8772 9937 93 8678

SEA 1v49DNN 9654 9998 9643 9817 9643 357 048 647 9998 5204 9796 7064

LSTM-RNN 9786 9998 9779 9887 9779 221 038 448 9998 6370 987 7874CNN 9878 9999 9874 9936 9874 126 019 240 9999 7551 9927 8622

GreenbergDataset

GreenbergTruncated

DNN 9397 9223 8067 8606 8067 1933 204 3157 9222 9441 8889 8253LSTM-RNN 9472 9488 8153 8770 8153 1847 132 2639 9487 9468 897 8476

CNN 9543 9616 8353 8940 8353 1647 10 2247 9616 9524 9094 8686

GreenbergEnriched

DNN 9757 9692 9240 9461 9240 760 088 1288 9692 9775 957 9308LSTM-RNN 9798 9757 9360 9554 9360 640 070 1060 9756 9810 9641 9428

CNN 9860 9855 9533 9692 9533 467 042 719 9855 9861 9743 9603

PU Dataset

PU TruncatedDNN 810 9959 7861 8786 7861 2139 225 3489 9959 3949 8766 5463

LSTM-RNN 8219 9969 7989 8870 7989 2011 175 3061 9968 4110 886 5646CNN 8375 9974 8164 8979 8164 1836 150 2736 9973 4338 8968 5879

PU EnrichedDNN 9044 9984 8921 9423 8921 1079 10 1679 9984 5672 9398 7064

LSTM-RNN 9131 9988 9018 9478 9018 982 075 1432 9988 5908 9461 7261CNN 9375 9992 9293 9630 9293 707 050 1007 9992 6678 9616 7852

the corresponding values of SEA data configuration also Hitvalues are dramatically increased in SEA 1v49 for all modelsby 10-14 from those that are in the SEA data configurationThis is due to the structure of SEA 1v49 data configurationwhere there are 122500 masquerader blocks in the test setof SEA 1v49 comparing to only 231 blocks in the SEA dataconfiguration Moreover the FAR values of SEA 1v49 for allmodels are significantly lower than the corresponding valuesof SEA data configuration Hence regarding SEA datasetSEA 1v49 is better to use in masquerade detection than SEAdata configuration

On the other hand as we expected Greenberg Enrichedenhanced noticeably the performance of all models in termsof all used evaluation metrics from the corresponding val-ues of Greenberg Truncated data configuration This canbe explained by the fact that Greenberg Enriched dataconfiguration has more information about user behaviorincluding command name parameters aliases and flagscomparing to only command name in Greenberg TruncatedTherefore regarding Greenberg dataset Greenberg Enricheddata configuration is better to use in masquerade detectionthan Greenberg Truncated The same thing happened inPU dataset where its PU Enriched data configuration hasbetter results regarding all models than PU Truncated Thusregarding PU dataset PU Enriched is better to use inmasquerade detection than PUTruncated data configuration

Actually PU Truncated and Greenberg Truncated dataconfigurations simulate SEA and SEA 1v49 data configu-rations where only command name is considered Despitethat regarding all used models SEA 1v49 recorded thebest results among the other truncated data configurationsOn the other hand PU Enriched and Greenberg Enriched

are considered as enriched data configurations where extrainformation about users is taken into consideration Due tothat enriched data configurations help models to build userrsquosbehavior profile more accurately than with truncated dataconfigurations Regarding all models the results associatedwithGreenberg Enriched especially in terms ofAccuracyHitand FAR values are better than of the corresponding valuesof PU Enriched data configuration because PU dataset isvery small masquerade detection dataset with a relatively lownumber of users (only 8 users) Also this reason can explainwhy a few previous works used PU dataset in masqueradedetection However data configurations can be sort for allused models from the upper to lower according to theobtained results as follows SEA 1v49 Greenberg EnrichedPU Enriched SEA Greenberg Truncated and PUTruncated

For the sake of brevity and space limitation we selected asubset of the used performancemetrics inTable 7 to be shownvisually in Figures 9 and 10 Figures 9(a) 9(b) 9(c) 9(d)9(e) 9(f) 9(g) and 9(h) showAccuracy HitMiss FAR CostBDR F1-Score and MCC percentages of the used modelsin each data configuration respectively Figures 10(a) 10(b)10(c) 10(d) 10(e) and 10(f) show Accuracy Hit FAR BDRF1-Score and MCC percentages for the average performanceof the used models on datasets respectively Figures 9 and10 can give us a visual comparison of the performance of theused deep learning models for each data configuration anddataset as well as in all datasets

By taking an inspective look to Figures 9 and 10 we cannotice the stability of deep learning models in such a waythat they are enhancing masquerade detection from a dataconfiguration to another in a consistent pattern To explainthat we will discuss the obtained results from the perspective

Security and Communication Networks 17

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnriched

PU EnrichedPU Truncated

0102030405060708090

100

Accura

cy (

)

(a)

0102030405060708090

100

Hit

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(b)

0

5

10

15

20

25

Miss

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(c)

002040608

112141618

22224

FAR

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(d)

0

5

10

15

20

25

30

35

Cos

t (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(e)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU Truncated

0102030405060708090

100

BDR

()

PU Enriched

(f)

Figure 9 Continued

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 5: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

Security and Communication Networks 5

Table 3 The structure of the used data configurations

Characteristics Data Configurations

SEA SEA 1v49 GreenbergTruncated

GreenbergEnriched

PUTruncated PU Enriched

Number of users 50 50 50 50 8 8Block Size 100 100 10 10 10 10

Number of blocks forevery user

Training set 2500 50 100 100 150 150Test set 100 2526sim2550 130 130 400 400Total 2600 2576sim2600 230 230 550 550

Number of blocks forall users

Training set 125000 2500 5000 5000 1200 1200Test set 5000 127269 6500 6500 3200 3200Total 130000 129769 11500 11500 4400 4400

Distribution of thetraining set

Normal 2500 2500 5000 5000 1200 1200Masquerader 122500 0 0 0 0 0

Total 125000 2500 5000 5000 1200 1200

Distribution of thetest set

Normal 4769 4769 5000 5000 400 400Masquerader 231 122500 1500 1500 2800 2800

Total 5000 127269 6500 6500 3200 3200

1

2

m

I1

I2

Im

Input LayerHidden Layers

Output Layer

1

2

j

1

2

k

1

2

n

1 h

O1

O2

On

2 h-1

Figure 1 The basic structure of a typical DNN

to that particular layer [4] The information in DNN ispropagated in a feed-forward manner that is from inputs tooutputs via hidden layers Figure 1 depicts the basic structureof a typical DNN

DNNs are widely used in various machine learning tasksIn addition to that they have proved their ability to surpassmost of the machine learning techniques in terms of perfor-mance [22] However the performance of any DNN relieson the selection of the values of its hyperparameters DNNhyperparameters are defined as a set of critical parametersthat control the architecture behavior and performance ofthat DNN in the underlying machine learning task Indeedthere are two kinds of such hyperparameters global parame-ters and layer-based parameters The global parameters arethose that defined the general behavior of DNN such aslearning rate epochs number batch size number of layers

and the used optimizer On the other hand layer-basedparameters values are dependent on each layer in DNNExamples of layer-based parameters are but not limitedto type of layer weight initialization method activationfunction and a number of neurons

The problem is that these hyperparameters are varyingfrom task to task and they must be set before the trainingprocess One familiar solution to overcome this problem isto find an expert who is conversant with the underlyingmachine learning task to tune precisely the DNN hyper-parameters Unfortunately the existence of such expert isnot available in all cases Another possible solution is toadjust these hyperparameters manually in a trial-and-errormanner This can be handled by searching the space ofhyperparameters by executing either grid search or randomsearch [23 24] A grid search is performed upon definedranges of hyperparameters where those ranges are identifiedpreviously depending on a prior knowledge of the underlyingtask After that the user picks up values of hyperparam-eters from the predefined ranges consecutively and teststhe performance of DNN on the training set When allpossible combination of hyperparameters values is testedthe best combination is selected to configure DNN andtest it on the test set Random search is similar to gridsearch but instead of picking up hyperparameters valuesin a methodical manner the user selects hyperparametersvalues from those predefined ranges randomly In 2012 Snoeket al have proposed a hyperparameters selection methodbased on Bayesian optimization [25] In this method theuser improves his knowledge of selecting hyperparametersby using the information gained from any given experimentto decide how to adjust the hyperparameters for the nextexperiment Despite good results that have been obtainedby the grid random and Bayesian optimization searchesin some cases in general the complexity and large search

6 Security and Communication Networks

space of theDNNhyperparameters valuesmake suchmanualalgorithms infeasible and too exhausting searching process

Evolutionary Algorithms (EAs) are metaheuristic algo-rithms which perform excellently for finding the globaloptima of a nonlinear function especially when there aremultiple local minima or maxima EAs are considered asvery promising algorithms for solving the problem of DNNparameterization automatically In the literature there are alot of studies that have been proposed recently aiming at usingEAs in optimizing DNN hyperparameters in order to gain ahigh accuracy value as much as possible Genetic Algorithm(GA) which is one of the most famous EAs has been usedto optimize the network parameters and the Taguchi methodis applied between the crossover and mutation operatorsincluding initial weights definition [26] GAs also are usedin the pretraining step prior to the supervised step based ona multiclass classification task [27] Another approach usingGA to reduce the training time has been presented in [28]TheGA is used to enhanceDeepNeuralNetworks by evolvinga neural networkrsquos weights [29] An automated GA-basedapproach has been proposed in [30] that optimized DNNhyperparameters for malware classification tasks MoreoverParticle Swarm Optimization is also one of the most well-known and popular EAs Lorenzo et al used PSO andproposed two approaches the first is sequential and thesecond is parallel to optimize hyperparameters of any DNN[31 32] Then Nalepa and Lorenzo proved formally theconvergence abilities of the former two approaches and testedthem separately on a single workstation and a cluster ofsequential and parallel approaches respectively [33] FinallyF Ye proposed in 2017 an automatic PSO-based algorithmto select DNN hyperparameters in large scale and highdimensional data [34]Thus we decided to use PSO to enableus to select hyperparameters for DNN automatically Thenin Section 51 we will explain how to adapt this algorithmfor static classification experiments used in a masqueradedetection scenario Section 41 introduces a necessary andbrief preface reviewing how standard PSO is working Thenthe rest of this section presents our proposed PSO-basedalgorithm to optimize DNN hyperparameters

41 Particle Swarm Optimization Particle Swarm Optimiza-tion (PSO) is a metaheuristic algorithm for optimizing non-linear functions in continuous search space It was proposedby Eberhart and Kennedy in 1995 [35] PSO tries to mimicthe social behavior of animals The swarm concept is a setof many members which are called particles The numberof particles in the swarm is an integer value denoted by119878 and called swarm size Every particle in the particularswarm has two vectors of 119873 length where 119873 is the sizeof the problem defined variables (dimensions) The firstvector is called position vector denoted by 119875 that identifiesthe current position of that particle in the search space ofthe problem Each position vector can be considered as acandidate solution of the problem The second vector iscalled velocity vector denoted by 119881 that determines bothspeed and direction of that particle in the search space ofthe problem at next iteration During the execution of PSOanother two vectors at every iteration should be stored The

first is called personal best vector denoted by 119875119894119887119890119904119905 whichindicates the best position of the 119894th particle in the swarmthat has been explored so far Each particle in the swarm hasits independent personal best vector from the other particlesand it is updated at each iteration The second vector is theglobal best vector denoted by Gbest which indicates the bestposition that has been found over the swarm so far There isa single global best vector for all particles in the swarm andit is updated at every iteration It can be looked to personalbest vector as the cognitive knowledge of the particle whereasthe global best vector represents the social knowledge ofthe swarm Mathematically for each particle 119894 in the swarm119878 at each iteration 119905 the velocity 119881 and position 119875 vectorsare updated to next iteration t+1 according to (1) and (2)respectively

119881119894119905+1 = 119882119881119894119905 + 11986211199031 (119905) (119875119894119887119890119904119905 minus 119875119894119905)+ 11986221199032 (119905) (119866119887119890119904119905 minus 119875119894119905)

(1)

119875119894119905+1 = 119875119894119905 + 119881119894119905+1 (2)

119882 is the inertia weight constant which controls the impactof the velocity of the particle at the current iteration onthe next iteration so the speed and direction of the particleare adjusted in order not to let the particle to get outsidethe search space of the problem Meanwhile 1198621 and 1198622 areconstants and known as acceleration coefficients 1199031 and 1199032are random values uniformly distributed in [0 1] At thebeginning of every iteration new values of 1199031 and 1199032 arecomputed randomly and they are constants for all particles inthe swarm at that iteration The goal of using 1198621 1198622 1199031 and1199032 constants is to scale both the cognitive knowledge of theparticle and the social knowledge of the swarmon the velocitychanges So the new position vectors of all particles willapproach to the optimal solution of the problem accordinglyFigure 2 depicts the flowchart of the standard PSO

In brief the standard PSOworks as follows First the userenters some required inputs like swarm size (S) dimensionsof the particles (N) acceleration constants (1198621 1198622) inertiaweight constant (W) fitness function (F) to score particleperformance in the problem domain and the maximumnumber of iterations (119905119898119886119909) Next PSO initializes positionand velocity vectors with the specified dimensions for allparticles in the swarm randomly Then PSO initializes thepersonal best vector for each particle in the swarm withthe specified dimensions and sets them to very small valueFurthermore PSO initializes the global best vector of theswarm with the specified dimensions and sets it to very smallvalue PSO computes the fitness score for each particle usingthe fitness function and updates the personal best vectorsfor all particles and the global best vector of the swarmAfter that PSO starts the first iteration by computing 1199031 and1199032 randomly and then updates velocity and position vectorsfor each particle according to (1) and (2) respectively Inaddition to that PSO computes again the fitness score foreach particle according to the given fitness function andupdates the personal best vector for each particle if the fitnessscore of that particle at this iteration is bigger than the fitness

Security and Communication Networks 7

YesNo

Start

(2) Initialize P and Vvectors particlesSof

each lengthNof

(5) For all S particles Compute F(P) and update Pi

best(6) Update Gbest

(8) Compute r1(t) and r2(t)(9) For all S particles

Update V P F(P) and Pibest

(10) Update Gbest

(12) Check Stop Criterion

satisfied

(13) Output Gbest as the optimal solution

Terminate

maxWF t

(1) Input SN C1 C2 (3) Pibest larr minusinfin i larr 1 to S

(4) Gbest larr minusinfin

(7) t larr 1

(11) t larr t+1

Figure 2 The flowchart of the standard PSO

score of the personal best vector of that particle (119865(119875119894119905 ) gt119865(119875119894119887119890119904119905)) Also PSO updates the global best vector of theswarm if any of the fitness score of the personal best vectorof the particles is bigger than the fitness score of the globalbest vector of the swarm (119865(119875119894119887119890119904119905) gt 119865(119866119887119890119904119905) i=1 to S)Then PSO checks the stop criterion and if one is satisfiedPSO will output the global best vector as the optimal solutionand terminate Else PSO will proceed to the next iterationand repeat the same procedure described in the first iterationabove until the stop criterion is reached

The stop criterion is satisfied when either the trainingerror is smaller than a predefined value () or the maximumnumber of iteration is reached Finally PSO performs betterthan GA in terms of simplicity and generality [36] PSO issimpler than GA because it contains only one operator andeasy to implement Also the generality of PSO means thatPSO does not need any modifications to be applied to anyoptimization problem as well as it is faster to converge to theoptimal solutionwhich decreases the computations and savesthe resources

42 DNN Hyperparameters Selection Using PSO The selec-tion of the hyperparameters of DNN can be interpreted as anoptimization task hence the main objective is to minimizethe loss function L(MT) where 119872 is the DNN model and119879 is the training set To achieve this goal we selected PSOto be our optimization algorithm that outputs the vectorof the optimized hyperparameters 119867 that minimized theloss function 119871 after constructed DNN model 119872 which istuned by the hyperparameters 119867 and trained on the trainingset 119879 The fitness function of our PSO-based algorithm isa function 119865lowast 119877119873 997888rarr 119877 that maps a real-valued vectorof hyperparameters that has a length of N to a real-valuedaccuracy value of the trained DNN that is tuned by thathyperparameters vector and tested on the test set 119885 Inother words our PSO-based algorithm finds the optimalhyperparameters vector among all possible combinations ofhyperparameters which yields to maximize the accuracy ofthe trained DNN on the test set Furthermore to ensurethe generality of our PSO-based algorithm which meansto be independent of the DNN that will be optimized andbe adapted easily to any classification task using DNN wewill allow the user to select which hyperparameters want touse in his work Therefore the user is responsible for usingour algorithm to define the number of the hyperparameters

as well as the type and domain of each parameter Thedomain of a parameter is the set of all possible values ofthat parameter After that our PSO-based algorithm willuse a special built-in generator that depends on the numberand domains of the defined parameters to initialize all theparticles (hyperparameters vectors) in the swarm

During the execution of the proposed algorithm andat each iteration the validation process is involved in theproposed algorithm to validate the updated position andvelocity vectors to be appropriate to the predefined rangesof parameters Finally in order to reduce computations andconverge faster two different stop conditions are checkedsimultaneously at the end of each iteration The first occurswhen the fitness score of the global best vector increasedless than a threshold which is specified by the userThe aim of the former condition is to guarantee that theglobal best vector cannot be improved further even if themaximumnumber of iterations is not reached yetThe secondcondition happens when the maximum number of iterationsis carried out Either the first or the second condition issatisfied then the proposed algorithm outputs the global bestvector as the optimal solution 119867 and terminates the searchprocess Figure 3 shows the flowchart of our PSO-basedDNNhyperparameters selection algorithm

43 Algorithm Steps

Inputs Number of hyperparameters (N) swarm size (S)acceleration constants (1198621 1198622) inertia constant (W) max-imum value of velocity (119881119898119886119909) minimum value of velocity(V119898119894119899) maximum number of iterations (t119898119886119909) evolutionthreshold () training set (T) and test set (Z)Output The optimal solution HProcedure

Step 1 For klarr9978881 to NLet h119896 be the k119905ℎ hyperparameterIf domain of h119896 is continuous then

let 119861119896119897119900119908 be the lower bound of h119896 and 119861119896119906119901be the upper bound of h119896

let user enter the lower and upper boundsof a hyperparameter h119896

End of if

8 Security and Communication Networks

(4) Initialize P and V vectors of Sparticles each of N length

(8) For all S particles

(12) For all S particles(16) Output

Yes

Terminate

Start User

(2) Define Domains for hk

(3) Create Hyper-parameters amp velocity generator

(1) Preprocessing Phase (2) Initialization Phase (3) Evolution Phase (4) Finishing Phase

No (15) Check Stop conditions

satisfied

(1) Input N S Vmin Vmax

klarr1 to N

(5) Input T Z C1 C2 W tmax

(6) Pibest larrminusinfin i larr1 to S(7) Gbest larr minusinfin

Compute Flowast(P) and update Pibest

(9) Update Gbest

(10) tlarr1

Compute V P Flowast(P) and Pibest

(13) Update Gbest

(14) tlarrt+1

(11) Compute r1(t) and r2(t)H larr Gbest

Figure 3 The flowchart of the proposed algorithm

Else

Let Y119896 be the set of all possible values of h119896

Let user enter all elements of the set Y119896

End of elseEnd of for

Step 2 Let 119865lowast be the fitness function which constructs DNNtuned with the given hyperparameters then trains DNN on119879 and tests it on 119885 Finally 119865lowast computes the accuracy ofDNN as output

Step 3 Let G119887119890119904119905 be the global best vector of the swarm oflength N

Let GS be the best fitness score of the swarmGSlarr997888 minusinfin

Step 4 For ilarr9978881 to SLet P119894 be the position vector of the 119894th particle oflength NLet V 119894 be the velocity vector of the 119894th particle oflength NLet 119875119894119887119890119904119905 be the personal best vector of the 119894thparticle of length NLet PS119894 be the fitness score of the personal bestvector of the 119894th particleFor jlarr9978881 to N

If domain of h119895 is continuous thenselect h119895 uniformly distributed

119875[119895] larr997888 U(119861119895119897119900119908

119861119895119906119901)End of ifElse

Select h119895 randomly by 119875119894[j] larr997888RAND (Y119895)

End of else119881119894[119895] larr997888 U(119881119898119894119899 119881119898119886119909)

End of for119875119894119887119890119904119905 larr997888 119875119894Let FS119894 be the fitness score of the 119894th particle

119865119878119894 larr997888 119865lowast(119875119894)119875119878119894 larr997888 119865119878119894If FS119894 gt GS then

119866119887119890119904119905 larr997888 119875119894119866119878 larr997888 119865119878119894

End of ifEnd of for

Step 5 Let GS119901119903119907 be the previous best fitness score of theswarm

119866119878119901119903V larr997888 119866119878Let 1199031 and 1199032 be random values in PSOLet 119905 be the current iterationFor tlarr9978881 to t119898119886119909

1199031 larr997888 119880(0 1)1199032 larr997888 119880(0 1)For ilarr997888 1 to S

Update V 119894 according to (1)Update P119894 according to (2)119865119878119894 larr997888 119865lowast(119875119894)If FS119894 gt PS119894 then119904119904119904119875119894119887119890119904119905 larr997888 119875119894119875119878119894 larr997888 119865119878119894End of ifIf PS119894 gt GS then119866119887119890119904119905 larr997888 119875119894119887119890119904119905119866119878 larr997888 119875119878119894End of if

End of forIf 119866119878- 119866119878119901119903V lt then

go to Step 6End of if

Security and Communication Networks 9

Table 4 PSO parameters recommended values or ranges

Parameter ValueRange119878 [5 20]119881119898119894119899 0119881119898119886119909 11198621 21198622 2119882 [04 09]119905119898119886119909 [30 50] 00001

119866119878119901119903V larr997888 119866119878End of for

Step 6 Let119867 be the optimal hyperparameters vector119867 larr997888 119866119887119890119904119905Return119867 and Terminate

44 PSO Parameters Selection of the value of PSO param-eters (S V119898119886119909 V119898119894119899 1198621 1198622 W t119898119886119909 ) is a very complexprocess Fortunatelymany empirical and theoretical previousstudies have been published to solve this problem [37ndash40] They introduced some recommended values of PSOparameters which can be taken Table 4 shows every PSOparameter and the corresponding recommended value orrange Thus for those parameters which have recommendedranges we can select a value for each parameter from its rangerandomly and fix it as a constant during the execution of PSO

5 Experimental Setup and Models

This section explains the methodology of performing ourempirical experiments as well as the description of deeplearning models which we used to detect masquerades Asmentioned in Section 3 we selected three UNIX commandline-based datasets (SEA Greenberg PU) Each of thesedatasets is a collection of text files inwhich each text file repre-sents a userThe text file of each user in the particular datasetcontains a set of UNIX commands that are issued by that userThis reflects the fact that these datasets do not contain anyreal masqueraders However to simulate masqueraders andto use these datasets in masquerade detection special dataconfigurations must be implemented prior to proceeding inour experiments According to Section 3 and its subsectionseach dataset has its two different types of data configurationsTherefore we obtained six data configurations that each onewill be observed separately which yields in the result to sixindependent experiments for each model Finally masquer-ade detection can be applied to these data configurationsby following two different main approaches namely staticclassification and dynamic classificationThe two subsequentsubsections present the difference between them as well aswhich deep learning models are exploited for each one

51 Static Classification Approach In the static classificationapproach the classification task is carried out using a dataset

of samples which are represented by a set of static features[30] These static features are defined according to the natureof the task where the classification will be applied In additionto that the dataset samples or also called observations arecollected manually by some experts working in the field ofthat classification task After that these samples are split intotwo independent sets known as training and test sets to trainand test the selected model respectively Static classificationapproach has pros and cons as well Although it provides afaster and easier solution it requires a ready-to-use datasetwith static features The existence of such dataset might notbe available in some complex classification tasks Hence theattempt to create a dataset with static features will be a hardmission In our work we decided to utilize the existenceof three famous UNIX command line-based datasets toimplement six different data configurations Each user inthe particular data configuration has a specific number ofblocks which are represented by a set of static featuresIndeed these features are the userrsquos UNIX commands incharge of describing the behavior of that user and laterhelping the classifier to detect masquerades We decided touse two well-known deep learning models namely DeepNeural Networks (DNN) and Recurrent Neural Networks(RNN) to accomplish the staticmasquerade detection task onthe implemented six data configurations

511 Deep Neural Networks In Section 4 we explained indetail the DNN structure and the problem of the selection ofits hyperparameters We also proposed PSO-based algorithmto obtain the optimal hyperparameters vector thatmaximizedthe accuracy of the DNN on the given training and test setsIn this subsection we describe how we utilized the proposedPSO-based algorithm and the DNN in static masqueradedetection task using six of data configurations which areSEA SEA 1v49 Greenberg Truncated Greenberg EnrichedPU Truncated and PU Enriched Every data configurationof them has its structure and a specific number of users asdescribed in Section 3 So we will have six separate DNN-experiments and each experiment will be on one of the dataconfigurations

The methodology of our DNN-experiments consists offour consecutive stages which are initialization optimiza-tion results extraction and finishing stages The first stageis to initialize all required operating parameters as well asto prepare the particular data configurationrsquos files in whicheach file represents a user in that data configurationThe userfile consists of the training set followed by the test set of thatuser We set all PSO parameters for all DNN-experiments asfollows S=20 V119898119894119899=0 V119898119886119909= 1 1198621=1198622=2 W=09 t119898119886119909=30and =10minus4 Then the last step in the initialization stage is todefine hyperparameters of the DNN and their domains Weused twelve different DNN hyperparameters (N=12) Table 5shows each DNN hyperparameter and its correspondingdefined domain All the used hyperparameters are numericalexcept that Optimizer Layer type Initialization function andActivation function hyperparameters are categorical In thiscase a list of all possible values is indexed to a sequenced-numbered range from 1 to the length of that list Optimizerlist includes elements Adagrad Nadam Adam Adamax

10 Security and Communication Networks

Table 5 The used DNN hyperparameters and their domains

Hyperparameter Domain DescriptionLearning rate [001 09] ContinuousMomentum [01 09] ContinuousDecay [0001 001 ContinuousDropout rate [01 09] ContinuousNumber of hidden layers [1 10] Discrete with step=1Numbers of neurons of hidden layers [1 100] Discrete with step=1Number of epochs [5 20] Discrete with step=5Batch size [100 1000] Discrete with step=50Optimizer [1 6] Discrete with step=1Initialization function [1 8] Discrete with step=1Layer type [1 2] Discrete with step=1Activation function [1 8] Discrete with step=1

RMSprop and SGD Layer type list contains two elementswhich are Dropout and Dense Initialization function listincludes elements Zero Normal Lecun uniform UniformGlorot uniform Glorot normal He uniform and He normalFinally Activation list has eight elements which are LinearSoftmax ReLU Sigmoid Tanh Hard Sigmoid Softsign andSoftplus It is worth mentioning that the elements of all cate-gorical hyperparameters are defined inKeras implementation[30]

The optimization and results extraction stages will beperformed once for each user in the particular data configu-ration that is they will be repeated for each user119880119894 i=12 M where 119872 is the number of users in the particular dataconfiguration119863The optimization stage starts by splitting thedata of the user119880119894 into two independent sets119879119894 and119885119894 whichare the training and test sets of the ith user respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 All blocks ofthe training and test sets are converted from text to numericvalues and then are normalized in [0 1] After that wesupplied these sets to the proposed PSO-based algorithm tofind the optimized hyperparameters vector119867119894 for the ith userIn addition to that we will save a copy of 119867119894 values in adatabase in order to save time and use them again in theRNN-experiment of that particular data configuration D aswill be presented in Section 512 The results extraction stagetakes place when constructing the DNN that is tuned by 119867119894trains the DNN on 119879119894 and tests the DNN on119885119894 The values ofthe classification outcomes True Positive (TP119894) False Positive(FP119894) True Negative (TN 119894) and False Negative (FN 119894) for theith user in the particular data configuration 119863 are extractedand saved for further processing later

Then the next user is observed and same procedure ofoptimization and results extraction stages is performed tillthe last user in the particular data configuration119863 is reachedFinally when all users in the particular data configurationare completed the last stage (finishing stage) is executedFinishing stage computes the summation of all obtained TPsof all users in the particular data configuration 119863 denotedby TP The same process will be applied also to the otheroutcomes namely FP TN and FN Equations (3) (4)

(5) and (6) express the formulas of TP FP TN and FNrespectively

119879119875 = 119872sum119894=1

119879119875119894 (3)

119865119875 = 119872sum119894=1

119865119875119894 (4)

119879119873 = 119872sum119894=1

119879119873119894 (5)

119865119873 = 119872sum119894=1

119865119873119894 (6)

The finishing stage will report and save these outcomes andend the DNN-experiment for the particular data configura-tion 119863 The former outcomes will be used to compute tenwell-known evaluation metrics to assess the performanceof the DNN on the particular data configuration D as willbe presented in Section 6 It is worth saying that the sameprocedure which is explained above will be done for eachdata configuration Figure 4 depicts the flowchart of themethodology of the DNN-experiments

512 Recurrent Neural Networks TheRecurrent Neural Net-work is a special type of the traditional feed-forwardArtificialNeural Network Unlike traditional ANN in the RNN eachneuron in any of the hidden layers has additional connectionsfrom its output to itself (self-recurrent) as well as to otherneurons of the same hidden layer Therefore the output ofthe RNNrsquos hidden layer at any time step (t) is for the currentinputs and the output of the hidden layer at the previous timestep (t-1) In RNN these directed cycles allow informationto circulate in the network and make the hidden layers asthe storage unit of the whole network [41] The importantcharacteristics of the RNN are the capability to have memoryand generate periodical sequences

Despite that the conventional RNN structure which isdescribed above has a serious problem especially when the

Security and Communication Networks 11

(9) Construct DNN that is tuned by Hi

(10) Train DNN on Ti

(11) Test DNN on Zi

No

(16) OutputTP FP TN and FN

Yes

End

Start

(1) Input Data configuration D M

(2) Set PSO parameters values

(3) Define Domains for Hyper-parameters

(3) Results Extraction Stage (4) Finishing Stage(1) Initialization Stage (2) Optimization Stage

(6) Execute the proposed PSO-based algorithm

(15) Compute and save TP FP TN and FN for D

(8) Database

(4) ilarr1

(7) Obtain Hi of the user Ui

(5) Create Ti and Zi sets of the user Ui

Hi

(12) Obtain and save TPi FPi TNi andFNi for the user Ui

(14) Is i gt M

(13) ilarri+1

Figure 4 The flowchart of the DNN-experiments

Inputxt

it ctℎt

Outputot

ft

Figure 5 The structure of an LSTM cell [6]

RNN is trained using the back-propagation technique Theproblem is known as gradient vanishing and exploding [42]The gradient vanishing problem occurs when the gradientsignal gets so small over the network which causes learningto become very slow or stop On the other hand the gradientexploding problem occurs when the gradient signal gets solarge in which learning divergesThis problem of the conven-tional RNN limited the use of the RNN to be only in short-termmemory tasks To solve this problem a new architectureof RNN is proposed by Hochreiter and Schmidhuber [43]known as Long Short-Term Memory (LSTM) LSTM uses anew structure called a memory cell that is composed of fourparts which are an input gate a neuron with a self-recurrentconnection a forget gate and the output gateMeanwhile themain goal of using a neuron with a self-recurrent connectionis to record information the aim of using three gates is tocontrol the flow of information from or into the memory cellThe input gate decides if to allow the incoming informationto enter into the memory cell or block it Moreover the forgetgate controls if to pass the previous state of the memory cellto alter the current state of the memory cell or prevent itFinally the output gate determines if to pass the output ofthe memory cell or not Figure 5 shows the structure of anLSTM memory cell Rather than overcoming the problemsof the conventional RNN LSTM model also outperformsthe conventional RNN in terms of performance especially inlong-term memory tasks [5] The LSTM-RNN model can beobtained by replacing every neuron in the hidden layers ofthe RNN to an LSTMmemory cell [6]

In this study we used the LSTM-RNN model to performa static masquerade detection task on all data configurationsAs mentioned in Section 511 there are six data config-urations and each of them will be used in the separate

experiment So we will have six separate LSTM-RNN-experiments each experiment will be on one of the dataconfigurations The methodology of all of these experimentsis the same and as follows for the given data configurationD we firstly prepared all the given data configurationrsquos filesby converting all blocks from text to numerical values andthen normalizing them in [0 1] Next to that for each user119880119894 in D where i=12 M and 119872 is the number of users inD we did the following steps we split the data of 119880119894 into twoindependent sets 119879119894 and 119885119894 which are the training and testsets of the ith user in D respectively The splitting processfollowed the structure of the particular data configurationwhich is described in Section 3 After that we retrieved thestored optimized hyperparameters vector of the ith user (119867119894)from the database which is created in the previous DNN-experiments Then we constructed the RNN model that istuned by119867119894 In order to obtain the LSTM-RNNmodel everyneuron in any of the hidden layers is replaced to an LSTMmemory cell The constructed LSTM-RNN model is trainedon119879119894 and then tested on119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 6 depicts theflowchart of the methodology of LSTM-RNN-experiments

52 Dynamic Classification Approach In contrast of staticclassification approach dynamic classification approach doesnot need a ready-to-use dataset with static features [30] Itcovenants directly with raw data sources such as text imagevideo sound and signal files and extracts features from themdynamically The models that use this approach try to learnand represent features in unsupervised manner Then thesemodels train themselves using the extracted features to beable to classify unseen dataThe deep learningmodels fit verywell for this approach because the main objectives of deeplearning models are the strong ability of automatic featureextraction and self-learning Rather than that dynamicclassification models overcome the problem of the lake ofdatasets it performs more efficient than the static classifica-tionmodels Despite these advantages dynamic classificationapproach has also drawbacks Dynamic classification modelsare slower and take a long time to train if compared with

12 Security and Communication Networks

YesNo

Hi

Start

(1) InputData configuration D M

(2) Prepare files of D

(4) Split data of Ui

into Ti and Zi sets

(7) Train LSTM-RNN model on Ti

(8) Test LSTM-RNN model on Zi

End

(5) Database

(6) Construct LSTM-RNN model that is tuned by Hi

(3) ilarr1

(9) Obtain and save TPi FPi TNi andFNi for the user Ui

(10) ilarri+1

(11) Is i gt M

(13) Output TPFP TN and FN

(12) Compute andsave TP FP TN

and FN for D

Figure 6 The flowchart of the LSTM-RNN-experiments

static classification models due to complex deep structure ofthesemodels as well as the huge amount of computations thatare required to execute Furthermore dynamic classificationmodels require a very large amount of input samples to gainhigh accuracy values

In this research we used six data configurations that areimplemented from three textual datasets In order to applydynamic masquerade detection on these data configurationswe need amodel that is able to extract features from the userrsquoscommand text file dynamically and then classify the user intoone of the two classes that will be either a normal user or amasqueraderTherefore we dealwith a text classification taskThe text classification is defined as a task that assigns a pieceof text (a word a sentence or even a document) to one ormore classes according to its content Indeed there are threetypes of text classification namely sentence classificationsentiment analysis and document categorization In sentenceclassification a given sentence should be assigned correctlyto one of possible classes Furthermore sentiment analysisdetermines if a given sentence is a positive negative orneutral towards a specific subject In contrast documentcategorization deals with documents and determines whichclass from a given set of possible classes a document belongsto According to the nature of dynamic classification as well asthe functionality of text classification deep learning modelsare the fittest among the other machine learning models forthese types of classification due to their powerful capability offeatures learning

A wide range of researches have been accomplished inthe literature in the field of text classification using deeplearning models It was started by LeCun et al in 1998 whenthey proposed a special topology of the Convolutional NeuralNetwork (CNN) known as LeNet family and used it in textclassification efficiently [44]Then various studies have beenpublished to introduce text classification algorithms as wellas the factors that impact the performance [45ndash47] In thestudy [48] the CNNmodel is used for sentence classificationtask over a set of text dataset benchmarks A single one-dimensional CNN is proposed to learn a region-based textembedding [49] X Zhang et al introduced a novel character-based multidimensional CNN for text classification taskswith competitive results [50] In the research [51] a newhierarchal approach calledHierarchal Deep Learning for Text

classification (HDLTex) is proposed and three deep struc-tures which are DNN RNN and CNN are used A recurrentconvolutional network model is introduced [52] for textclassification and high results are obtained on documents-level datasets A novel LSTM-based model is introduced andused for text classification withmultitask learning framework[53] The study [54] proposed a new model called hierarchalattention network for document classification and is testedon six large document-level datasets with good results Acharacter-level text representations approach is proposed andtested for text classification tasks using deep CNN [55]As noticed the CNN is the mostly used deep learningmodel for text classification tasks So we decided to use theCNN to perform dynamic masquerade detection on all dataconfigurations The following subsection reviews the CNNand explains the structure of the used CNN model and themethodology of our CNN-experiments

521 Convolutional Neural Networks The ConvolutionalNeural Network (CNN) is a deep learning model whichis biological-inspired from the animal visual cortex TheCNN can be considered as a special type of the traditionalfeed-forwardArtificial Neural NetworkThemajor differencebetween ANN and CNN is that instead of the fully connectedarchitecture of ANN the individual neurons in CNN areconnected to subregions of the input field The neurons ofthe CNN are arranged in such a way they are tilled to coverthe entire input field The typical CNN consists of five maincomponents namely an input layer the convolutional layerthe pooling layer the fully connected layer and an outputlayer The input layer is where the input data is enteredinto the CNN The first convolutional layer in the CNNconsists of individual neurons that each of them is connectedto a small subset of the input field The neurons in thenext convolutional layers connect only to a subset of theirpreceding pooling layerrsquos outputMoreover the convolutionallayers in the CNN use a set of learnable kernels or filters thateach filter is applied to the specified subset of their precedinglayerrsquos output These filters calculate feature maps in whicheach feature map shares the same weights The poolinglayer also known as a subsampling layer is a nonlineardownsampling function that condenses subsets of its inputThemain goal of using pooling layers in the CNN is to reduce

Security and Communication Networks 13

Userrsquos Command Text Files

Quantization

Input Layer

Convolutional layer

C1 features map P1 features map

Max-Pooling layer

C2 P2 C6 P6

Fully-Connected dropout layers

2048 sigmoid neurons

2048 sigmoid neurons 2

softmaxneurons

Outputdense layer

0 (Normal)1 (Masquerader)

Figure 7 The architecture of the used CNNmodel

the complexity and computations by reducing the size of theirpreceding layerrsquos output There are many pooling nonlinearfunctions that can be used but among them max-poolingis the mostly used which selects the maximum value in thegiven pooling window Typically each convolutional layer inthe CNN is followed by a max-pooling layer The CNN hasone or more stacked convolutional layer and max-poolinglayer pairs to extract features from the entire input and thenmap these features to their next fully connected layerThe toplayers of the CNN are one or more of fully connected layerswhich are similar to hidden layers in the DNN This meansthat neurons of the fully connected layers are connected to allneurons of the preceding layer The output layer is the finallayer in the CNN and is responsible for reporting the outputvalue of the CNN Finally the back-propagation algorithm isusually used to train CNNs via Stochastic Gradient Decent(SGD) to adjust the weights of the fully connected layers [56]There are several variant structures of CNN that are proposedin the literature but LeNet structure which is proposed byLeCun et al [44] is themost common approach used inmanyapplications of computer vision and text classification

Regarding its stability and high efficiency in text clas-sification we selected the CNN model which is proposedin [50] to perform a dynamic masquerade detection on alldata configurationsThe usedmodel is a character-level CNNthat takes a text file as input and outputs the classificationscore (0 if the input text file is related to a normal user or1 otherwise) The used CNN model is from LeNet familyand consists of an input layer followed by six convolutionand max-pooling pairs followed by two fully connectedlayers and finally followed by an output layer In the inputlayer the text quantization process takes place when theused model encodes all letters in the input text file using aone-hot representation from a 70-character alphabet All theconvolutional layers in the used CNN model have a ReLUnonlinear activation functionThe two fully connected layersin the used CNN model are of the type dropout layer withdropout probability equal to 05 In addition to that the twofully connected layers in the usedCNNmodel have a Sigmoidnonlinear activation function as well as they have the samesize of 2048 neurons of each The output layer in the usedCNN model is of the type dense layer as well as it has asoftmax activation function and size of two neurons Theused CNN model is trained by back-propagation algorithmvia SGD Finally we set the following parameters to the

used CNN model learning rate=001 epochs=30 and batchsize=64 These values are obtained experimentally by per-forming a grid search to find the best possible values of theseparameters Figure 7 shows the architecture of the used CNNmodel and is reproduced from Zhang et al (2015) [under theCreative Commons Attribution Licensepublic domain]

In our work we used a CNNmodel to perform a dynamicmasquerade detection task on all data configurations Asmentioned in Section 511 there are six data configurationsand each of them will be used in the separate experimentSo we will have six separate CNN-experiments and eachexperiment will be on one of the data configurations Themethodology of all of these experiments is the same and asfollows for the given data configurationD we firstly preparedall the given data configurationrsquos text files such that each file ofthem represents the training and test sets of a user in119863 Nextto that for each user 119880119894 in D where i=12 M and119872 is thenumber of users in D we did the following steps we split thedata of 119880119894 into two independent sets 119879119894 and 119885119894 which are thetraining and test sets of the ith user in D respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 Furthermorewe also moved each block in the training and test sets of theuser 119880119894 to a separate text file This means that each of thetraining and test sets of the user 119880119894 consists of a specifiednumber of text files in which each text file contains one blockof UNIX commands After that we constructed the usedCNN model The constructed CNN model is trained on 119879119894and then tested on 119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 8 depicts theflowchart of the methodology of CNN-experiments

6 Results and Discussion

We carried out three major empirical experiments whichareDNN-experiments LSTM-RNN-experiments andCNN-experiments Each of them consists of six separate subex-periments where each subexperiment is performed on oneof the data configurations SEA SEA 1v49 Greenberg Trun-cated Greenberg Enriched PU Truncated and PU Enriched

14 Security and Communication Networks

YesNo

Start

(1) Input

(2) Prepare text files of D

(4) Split data of Ui

Ti and Zi text sets(6) Construct the used CNN model

(7) Train CNN model on Ti

(8) Test CNN model on Zi

(13) Output TP FP TN and FNEnd

(5) Move each block in Ti and Zi to a separate text file

Data configuration D M

(3) ilarr1

(12) Compute and save TPFP TN and FN for D

(9) Obtain and save TPi FPi TNiand FNi for the user Ui

(11) Is i gt M

(10) ilarri+1

into

Figure 8 The flowchart of the CNN-experiments

Table 6 The confusion matrix of the masquerade detection out-comes

Actual Class Predicted ClassNormal User Masquerader

Normal User TN FPMasquerader FN TP

Basically our PSO-based DNN hyperparameters selectionalgorithmwas implemented in Python 364 [57]withNumPy[58] Moreover all models (DNN LSTM-RNN CNN) wereconstructed and trained and tested based on Keras [59 60]with TensorFlow 16 [61 62] that backend over CUDA 90[63] and cuDNN 70 [64] In addition to that all experimentswere performed on a workstation with an Intel Core i7 CPU(38GHz 16 MB Cache) 16GB of RAM and theWindows 10operating system In order to accelerate the computations inall experiments we also used a GPU-accelerated computingwith NVIDIA Tesla K20 GPU 5GB GDDR5The experimen-tal environment is processed in 64-bit mode

In any classification task we have four possible outcomesTrue Positive (TP) True Negative (TN) False Positive (FP)and False Negative (FN) We get a TP when a masqueraderis correctly classified as a masquerader Whenever a gooduser is correctly classified as a good user itself we say it isa TN A FP occurs when a good user is misclassified as amasquerader In contrast FN occurs when a masqueraderis misclassified as a good user Table 6 shows the ConfusionMatrix of the masquerade detection outcomes For eachdata configuration we used the obtained outcomes for thatdata configuration to compute twelve well-known evaluationmetrics After that by using these evaluation metrics weassessed the performance of each deep learningmodel on thatdata configuration

For simplicity we divided these evaluation metrics intotwo categories General Classification Measures and Mas-querade Detection Measures The General ClassificationMeasures are metrics that are used for any classification tasknamely Accuracy Precision Recall and F1-Score On theother handMasquerade DetectionMeasures are metrics thatusually are used for a masquerade or intrusion detection

task which are Hit Rate Miss Rate False Alarm RateCost Bayesian Detection Rate Bayesian True Negative RateGeometric Mean andMatthews Correlation CoefficientTheused evaluation metrics definition and their correspondingequations are as follows

(i) Accuracy shows the rate of true detection over all testsets

119860119888119888119906119903119886119888119910 = 119879119875 + 119879119873119879119875 + 119879119873 + 119865119875 + 119865119873 (7)

(ii) Precision shows the rate of correctly classified mas-queraders from all blocks in the test set that areclassified as masqueraders

119875119903119890119888119894119904119894119900119899 = 119879119875119879119875 + 119865119875 (8)

(iii) Recall shows the rate of correctly classified masquer-aders over all masquerader blocks in the test set

119877119890119888119886119897119897 = 119879119875119879119875 + 119865119873 (9)

(iv) F1-Score gives information about the accuracy of aclassifier regarding both Precision (P) and Recall (R)metrics

1198651 119878119888119900119903119890 = 21119875 + 1119877 (10)

(v) Hit Rate shows the rate of correctly classified mas-querader blocks over all masquerader blocks pre-sented in the test set It is also called Hits TruePositive Rate or Detection Rate

119867119894119905 119877119886119905119890 = 119879119875119879119875 + 119865119873 (11)

(vi) Miss Rate is the complement of Hit Rate (Miss=100-Hit) ie it shows the rate of masquerade blocksthat are misclassified as a normal user from allmasquerade blocks in the test set It is also calledMisses or False Negative Rate

119872119894119904119904 119877119886119905119890 = 119865119873119865119873 + 119879119875 (12)

Security and Communication Networks 15

(vii) False Alarm Rate (FAR) gives information about therate of normal user blocks that are misclassified as amasquerader over all normal user blocks presented inthe test set It is also called False Positive Rate

119865119886119897119904119890 119860119897119886119903119898 119877119886119905119890 = 119865119875119865119875 + 119879119873 (13)

(viii) Cost is a metric that was proposed in [9] to evaluatethe efficiency of a classifier concerning bothMiss Rate(MR) and False Alarm Rate (FAR) metrics

119862119900119904119905 = 119872119877 + 6 times 119865119860119877 (14)

(ix) Bayesian Detection Rate (BDR) is a metric basedon Base-Rate Fallacy problem which is addressedby S Axelsson in 1999 [65] Base-Rate Fallacy is abasis of Bayesian statistics and occurs when peo-ple do not take the basic rate of incidence (Base-Rate) into their account when solving problems inprobabilities Unlike Hit Rate metric BDR shows therate of correctly classified masquerader blocks overall test set taking into consideration the base-rate ofmasqueraders Let I and Ilowast denote a masquerade anda normal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectivelyThen BDR can be computed asthe probability P(I | A) according to (15) [65]119861119886119910119890119904119894119886119899 119863119890119905119890119888119905119894119900119899 119877119886119905119890 = 119875 (119868 | 119860)

= 119875 (119868) times 119875 (119860 | 119868)119875 (119868) times 119875 (119860 | 119868) + 119875 (119868lowast) times 119875 (119860 | 119868lowast)(15)

P(I) is the rate of the masquerader blocks in the testset P(A | I) is the Hit Rate P(Ilowast) is the rate of thenormal blocks in the test set and P(A | Ilowast) is the FAR

(x) Bayesian True Negative Rate (BTNR) is also basedon Base-Rate Fallacy and shows the rate of trulyclassified normal blocks over all test set in which thepredicted normal behavior indicates really a normaluser [65] Let I and Ilowast denote a masquerade and anormal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectively Then BTNR can be computedas the probability P(Ilowast | Alowast) according to (16) [65]

119861119886119910119890119904119894119886119899 119879119903119906119890 119873119890119892119886119905119894V119890 119877119886119905119890 = 119875 (119868lowast | 119860lowast)= 119875 (119868lowast) times 119875 (119860lowast | 119868lowast)

119875 (119868lowast) times 119875 (119860lowast | 119868lowast) + 119875 (119868) times 119875 (119860lowast | 119868)(16)

P(Ilowast) is the rate of the normal blocks in the test setP(Alowast | Ilowast) is the True Negative Rate which is easilyobtained by calculating (1-FAR) P(I) is the rate of themasquerader blocks in the test set and P(Alowast | I) isthe Miss Rate

(xi) Geometric Mean (g-mean) is a performance metricthat combines true negative rate and true positive

rate at one specific threshold where both the errorsare considered equal This metric has been usedby several researchers for evaluating classifiers onimbalance dataset [66] It can be computed accordingto (17) [67]

119892 119898119890119886119899 = radic 119879119875 times 119879119873(119879119875 + 119865119873) times (119879119873 + 119865119875) (17)

(xii) Matthews Correlation Coefficient (MCC) is a perfor-mance metric that takes into account true and falsepositives and negatives and is generally regarded asa balanced measure which can be used even if theclasses are of very different sizes (imbalance dataset)[68] MCC has a range of minus1 to 1 where minus1 indicates acompletely wrong binary classifier while 1 indicates acompletely correct binary classifier Unlike the othermetrics discussed aboveMCC takes all the cells of theConfusion Matrix into consideration in its formulawhich can be computed according to (18) [69]

119872119862119862= (119879119875 times 119879119873) minus (119865119875 times 119865119873)radic(119879119875 + 119865119873) times (119879119875 + 119865119875) times (119879119873 + 119865119875) times (119879119873 + 119865119873)

(18)

In the following two subsections we will present our experi-mental results and explain them using two kinds of analysesperformance analysis and ROC curves analysis

61 Performance Analysis The effectiveness of any modelto detect masqueraders depends on its values of evaluationmetrics The higher values of Accuracy Precision RecallF1-Score Hit Rate Bayesian Detection Rate Bayesian TrueNegative Rate Geometric Mean and Matthews CorrelationCoefficient as well as the lower values of Miss Rate FalseAlarm Rate and Cost indicate an efficient classifierThe idealclassifier hasAccuracy andHit Rate values that reach 1 as wellasMiss Rate and False AlarmRate values that reach 0 Table 7presents the percentages of the used evaluation metricsfor DNN-experiments LSTM-RNN-experiments and CNN-experiments Actually the rows labeled by DNN and LSTM-RNN in Table 7 show results of the static masquerade detec-tion by using DNN and LSTM-RNN models respectivelywhereas the rows labeled by CNN in Table 7 show resultsof the dynamic masquerade detection by using CNN modelFurthermore the bold rows represent the best results amongthe same data configuration whereas the underlined valuesare the best for all data configurations

First of all the impact of using our PSO-based algorithmcan be seen in the obtained results of both DNN and LSTM-RNN models The PSO-based algorithm is used to optimizethe selection of DNN hyperparameters that maximized theaccuracy which means that the sum of TP and TN outcomeswill be increased significantly Thus according to (11) and(13) increasing the sum of TP and TN will lead definitelyto the increase of the value of Hit as well as to the decreaseof the value of FAR Although the accuracy values of SEA1v49 data configuration for all models are slightly lower than

16 Security and Communication Networks

Table 7 The results of our experiments

Dataset DataConfiguration Model Evaluation Metrics ()

Accuracy Precision Recall F1-Score Hit Miss FAR Cost BDR BTNR g-mean MCC

SEA Dataset

SEADNN 9808 7626 8485 8033 8485 1515 128 2283 7625 9926 9152 7945

LSTM-RNN 9852 8230 8658 8439 8658 1342 090 1883 8233 9934 9263 8364CNN 9884 8777 8701 8739 8701 1299 059 1651 8772 9937 93 8678

SEA 1v49DNN 9654 9998 9643 9817 9643 357 048 647 9998 5204 9796 7064

LSTM-RNN 9786 9998 9779 9887 9779 221 038 448 9998 6370 987 7874CNN 9878 9999 9874 9936 9874 126 019 240 9999 7551 9927 8622

GreenbergDataset

GreenbergTruncated

DNN 9397 9223 8067 8606 8067 1933 204 3157 9222 9441 8889 8253LSTM-RNN 9472 9488 8153 8770 8153 1847 132 2639 9487 9468 897 8476

CNN 9543 9616 8353 8940 8353 1647 10 2247 9616 9524 9094 8686

GreenbergEnriched

DNN 9757 9692 9240 9461 9240 760 088 1288 9692 9775 957 9308LSTM-RNN 9798 9757 9360 9554 9360 640 070 1060 9756 9810 9641 9428

CNN 9860 9855 9533 9692 9533 467 042 719 9855 9861 9743 9603

PU Dataset

PU TruncatedDNN 810 9959 7861 8786 7861 2139 225 3489 9959 3949 8766 5463

LSTM-RNN 8219 9969 7989 8870 7989 2011 175 3061 9968 4110 886 5646CNN 8375 9974 8164 8979 8164 1836 150 2736 9973 4338 8968 5879

PU EnrichedDNN 9044 9984 8921 9423 8921 1079 10 1679 9984 5672 9398 7064

LSTM-RNN 9131 9988 9018 9478 9018 982 075 1432 9988 5908 9461 7261CNN 9375 9992 9293 9630 9293 707 050 1007 9992 6678 9616 7852

the corresponding values of SEA data configuration also Hitvalues are dramatically increased in SEA 1v49 for all modelsby 10-14 from those that are in the SEA data configurationThis is due to the structure of SEA 1v49 data configurationwhere there are 122500 masquerader blocks in the test setof SEA 1v49 comparing to only 231 blocks in the SEA dataconfiguration Moreover the FAR values of SEA 1v49 for allmodels are significantly lower than the corresponding valuesof SEA data configuration Hence regarding SEA datasetSEA 1v49 is better to use in masquerade detection than SEAdata configuration

On the other hand as we expected Greenberg Enrichedenhanced noticeably the performance of all models in termsof all used evaluation metrics from the corresponding val-ues of Greenberg Truncated data configuration This canbe explained by the fact that Greenberg Enriched dataconfiguration has more information about user behaviorincluding command name parameters aliases and flagscomparing to only command name in Greenberg TruncatedTherefore regarding Greenberg dataset Greenberg Enricheddata configuration is better to use in masquerade detectionthan Greenberg Truncated The same thing happened inPU dataset where its PU Enriched data configuration hasbetter results regarding all models than PU Truncated Thusregarding PU dataset PU Enriched is better to use inmasquerade detection than PUTruncated data configuration

Actually PU Truncated and Greenberg Truncated dataconfigurations simulate SEA and SEA 1v49 data configu-rations where only command name is considered Despitethat regarding all used models SEA 1v49 recorded thebest results among the other truncated data configurationsOn the other hand PU Enriched and Greenberg Enriched

are considered as enriched data configurations where extrainformation about users is taken into consideration Due tothat enriched data configurations help models to build userrsquosbehavior profile more accurately than with truncated dataconfigurations Regarding all models the results associatedwithGreenberg Enriched especially in terms ofAccuracyHitand FAR values are better than of the corresponding valuesof PU Enriched data configuration because PU dataset isvery small masquerade detection dataset with a relatively lownumber of users (only 8 users) Also this reason can explainwhy a few previous works used PU dataset in masqueradedetection However data configurations can be sort for allused models from the upper to lower according to theobtained results as follows SEA 1v49 Greenberg EnrichedPU Enriched SEA Greenberg Truncated and PUTruncated

For the sake of brevity and space limitation we selected asubset of the used performancemetrics inTable 7 to be shownvisually in Figures 9 and 10 Figures 9(a) 9(b) 9(c) 9(d)9(e) 9(f) 9(g) and 9(h) showAccuracy HitMiss FAR CostBDR F1-Score and MCC percentages of the used modelsin each data configuration respectively Figures 10(a) 10(b)10(c) 10(d) 10(e) and 10(f) show Accuracy Hit FAR BDRF1-Score and MCC percentages for the average performanceof the used models on datasets respectively Figures 9 and10 can give us a visual comparison of the performance of theused deep learning models for each data configuration anddataset as well as in all datasets

By taking an inspective look to Figures 9 and 10 we cannotice the stability of deep learning models in such a waythat they are enhancing masquerade detection from a dataconfiguration to another in a consistent pattern To explainthat we will discuss the obtained results from the perspective

Security and Communication Networks 17

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnriched

PU EnrichedPU Truncated

0102030405060708090

100

Accura

cy (

)

(a)

0102030405060708090

100

Hit

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(b)

0

5

10

15

20

25

Miss

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(c)

002040608

112141618

22224

FAR

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(d)

0

5

10

15

20

25

30

35

Cos

t (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(e)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU Truncated

0102030405060708090

100

BDR

()

PU Enriched

(f)

Figure 9 Continued

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 6: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

6 Security and Communication Networks

space of theDNNhyperparameters valuesmake suchmanualalgorithms infeasible and too exhausting searching process

Evolutionary Algorithms (EAs) are metaheuristic algo-rithms which perform excellently for finding the globaloptima of a nonlinear function especially when there aremultiple local minima or maxima EAs are considered asvery promising algorithms for solving the problem of DNNparameterization automatically In the literature there are alot of studies that have been proposed recently aiming at usingEAs in optimizing DNN hyperparameters in order to gain ahigh accuracy value as much as possible Genetic Algorithm(GA) which is one of the most famous EAs has been usedto optimize the network parameters and the Taguchi methodis applied between the crossover and mutation operatorsincluding initial weights definition [26] GAs also are usedin the pretraining step prior to the supervised step based ona multiclass classification task [27] Another approach usingGA to reduce the training time has been presented in [28]TheGA is used to enhanceDeepNeuralNetworks by evolvinga neural networkrsquos weights [29] An automated GA-basedapproach has been proposed in [30] that optimized DNNhyperparameters for malware classification tasks MoreoverParticle Swarm Optimization is also one of the most well-known and popular EAs Lorenzo et al used PSO andproposed two approaches the first is sequential and thesecond is parallel to optimize hyperparameters of any DNN[31 32] Then Nalepa and Lorenzo proved formally theconvergence abilities of the former two approaches and testedthem separately on a single workstation and a cluster ofsequential and parallel approaches respectively [33] FinallyF Ye proposed in 2017 an automatic PSO-based algorithmto select DNN hyperparameters in large scale and highdimensional data [34]Thus we decided to use PSO to enableus to select hyperparameters for DNN automatically Thenin Section 51 we will explain how to adapt this algorithmfor static classification experiments used in a masqueradedetection scenario Section 41 introduces a necessary andbrief preface reviewing how standard PSO is working Thenthe rest of this section presents our proposed PSO-basedalgorithm to optimize DNN hyperparameters

41 Particle Swarm Optimization Particle Swarm Optimiza-tion (PSO) is a metaheuristic algorithm for optimizing non-linear functions in continuous search space It was proposedby Eberhart and Kennedy in 1995 [35] PSO tries to mimicthe social behavior of animals The swarm concept is a setof many members which are called particles The numberof particles in the swarm is an integer value denoted by119878 and called swarm size Every particle in the particularswarm has two vectors of 119873 length where 119873 is the sizeof the problem defined variables (dimensions) The firstvector is called position vector denoted by 119875 that identifiesthe current position of that particle in the search space ofthe problem Each position vector can be considered as acandidate solution of the problem The second vector iscalled velocity vector denoted by 119881 that determines bothspeed and direction of that particle in the search space ofthe problem at next iteration During the execution of PSOanother two vectors at every iteration should be stored The

first is called personal best vector denoted by 119875119894119887119890119904119905 whichindicates the best position of the 119894th particle in the swarmthat has been explored so far Each particle in the swarm hasits independent personal best vector from the other particlesand it is updated at each iteration The second vector is theglobal best vector denoted by Gbest which indicates the bestposition that has been found over the swarm so far There isa single global best vector for all particles in the swarm andit is updated at every iteration It can be looked to personalbest vector as the cognitive knowledge of the particle whereasthe global best vector represents the social knowledge ofthe swarm Mathematically for each particle 119894 in the swarm119878 at each iteration 119905 the velocity 119881 and position 119875 vectorsare updated to next iteration t+1 according to (1) and (2)respectively

119881119894119905+1 = 119882119881119894119905 + 11986211199031 (119905) (119875119894119887119890119904119905 minus 119875119894119905)+ 11986221199032 (119905) (119866119887119890119904119905 minus 119875119894119905)

(1)

119875119894119905+1 = 119875119894119905 + 119881119894119905+1 (2)

119882 is the inertia weight constant which controls the impactof the velocity of the particle at the current iteration onthe next iteration so the speed and direction of the particleare adjusted in order not to let the particle to get outsidethe search space of the problem Meanwhile 1198621 and 1198622 areconstants and known as acceleration coefficients 1199031 and 1199032are random values uniformly distributed in [0 1] At thebeginning of every iteration new values of 1199031 and 1199032 arecomputed randomly and they are constants for all particles inthe swarm at that iteration The goal of using 1198621 1198622 1199031 and1199032 constants is to scale both the cognitive knowledge of theparticle and the social knowledge of the swarmon the velocitychanges So the new position vectors of all particles willapproach to the optimal solution of the problem accordinglyFigure 2 depicts the flowchart of the standard PSO

In brief the standard PSOworks as follows First the userenters some required inputs like swarm size (S) dimensionsof the particles (N) acceleration constants (1198621 1198622) inertiaweight constant (W) fitness function (F) to score particleperformance in the problem domain and the maximumnumber of iterations (119905119898119886119909) Next PSO initializes positionand velocity vectors with the specified dimensions for allparticles in the swarm randomly Then PSO initializes thepersonal best vector for each particle in the swarm withthe specified dimensions and sets them to very small valueFurthermore PSO initializes the global best vector of theswarm with the specified dimensions and sets it to very smallvalue PSO computes the fitness score for each particle usingthe fitness function and updates the personal best vectorsfor all particles and the global best vector of the swarmAfter that PSO starts the first iteration by computing 1199031 and1199032 randomly and then updates velocity and position vectorsfor each particle according to (1) and (2) respectively Inaddition to that PSO computes again the fitness score foreach particle according to the given fitness function andupdates the personal best vector for each particle if the fitnessscore of that particle at this iteration is bigger than the fitness

Security and Communication Networks 7

YesNo

Start

(2) Initialize P and Vvectors particlesSof

each lengthNof

(5) For all S particles Compute F(P) and update Pi

best(6) Update Gbest

(8) Compute r1(t) and r2(t)(9) For all S particles

Update V P F(P) and Pibest

(10) Update Gbest

(12) Check Stop Criterion

satisfied

(13) Output Gbest as the optimal solution

Terminate

maxWF t

(1) Input SN C1 C2 (3) Pibest larr minusinfin i larr 1 to S

(4) Gbest larr minusinfin

(7) t larr 1

(11) t larr t+1

Figure 2 The flowchart of the standard PSO

score of the personal best vector of that particle (119865(119875119894119905 ) gt119865(119875119894119887119890119904119905)) Also PSO updates the global best vector of theswarm if any of the fitness score of the personal best vectorof the particles is bigger than the fitness score of the globalbest vector of the swarm (119865(119875119894119887119890119904119905) gt 119865(119866119887119890119904119905) i=1 to S)Then PSO checks the stop criterion and if one is satisfiedPSO will output the global best vector as the optimal solutionand terminate Else PSO will proceed to the next iterationand repeat the same procedure described in the first iterationabove until the stop criterion is reached

The stop criterion is satisfied when either the trainingerror is smaller than a predefined value () or the maximumnumber of iteration is reached Finally PSO performs betterthan GA in terms of simplicity and generality [36] PSO issimpler than GA because it contains only one operator andeasy to implement Also the generality of PSO means thatPSO does not need any modifications to be applied to anyoptimization problem as well as it is faster to converge to theoptimal solutionwhich decreases the computations and savesthe resources

42 DNN Hyperparameters Selection Using PSO The selec-tion of the hyperparameters of DNN can be interpreted as anoptimization task hence the main objective is to minimizethe loss function L(MT) where 119872 is the DNN model and119879 is the training set To achieve this goal we selected PSOto be our optimization algorithm that outputs the vectorof the optimized hyperparameters 119867 that minimized theloss function 119871 after constructed DNN model 119872 which istuned by the hyperparameters 119867 and trained on the trainingset 119879 The fitness function of our PSO-based algorithm isa function 119865lowast 119877119873 997888rarr 119877 that maps a real-valued vectorof hyperparameters that has a length of N to a real-valuedaccuracy value of the trained DNN that is tuned by thathyperparameters vector and tested on the test set 119885 Inother words our PSO-based algorithm finds the optimalhyperparameters vector among all possible combinations ofhyperparameters which yields to maximize the accuracy ofthe trained DNN on the test set Furthermore to ensurethe generality of our PSO-based algorithm which meansto be independent of the DNN that will be optimized andbe adapted easily to any classification task using DNN wewill allow the user to select which hyperparameters want touse in his work Therefore the user is responsible for usingour algorithm to define the number of the hyperparameters

as well as the type and domain of each parameter Thedomain of a parameter is the set of all possible values ofthat parameter After that our PSO-based algorithm willuse a special built-in generator that depends on the numberand domains of the defined parameters to initialize all theparticles (hyperparameters vectors) in the swarm

During the execution of the proposed algorithm andat each iteration the validation process is involved in theproposed algorithm to validate the updated position andvelocity vectors to be appropriate to the predefined rangesof parameters Finally in order to reduce computations andconverge faster two different stop conditions are checkedsimultaneously at the end of each iteration The first occurswhen the fitness score of the global best vector increasedless than a threshold which is specified by the userThe aim of the former condition is to guarantee that theglobal best vector cannot be improved further even if themaximumnumber of iterations is not reached yetThe secondcondition happens when the maximum number of iterationsis carried out Either the first or the second condition issatisfied then the proposed algorithm outputs the global bestvector as the optimal solution 119867 and terminates the searchprocess Figure 3 shows the flowchart of our PSO-basedDNNhyperparameters selection algorithm

43 Algorithm Steps

Inputs Number of hyperparameters (N) swarm size (S)acceleration constants (1198621 1198622) inertia constant (W) max-imum value of velocity (119881119898119886119909) minimum value of velocity(V119898119894119899) maximum number of iterations (t119898119886119909) evolutionthreshold () training set (T) and test set (Z)Output The optimal solution HProcedure

Step 1 For klarr9978881 to NLet h119896 be the k119905ℎ hyperparameterIf domain of h119896 is continuous then

let 119861119896119897119900119908 be the lower bound of h119896 and 119861119896119906119901be the upper bound of h119896

let user enter the lower and upper boundsof a hyperparameter h119896

End of if

8 Security and Communication Networks

(4) Initialize P and V vectors of Sparticles each of N length

(8) For all S particles

(12) For all S particles(16) Output

Yes

Terminate

Start User

(2) Define Domains for hk

(3) Create Hyper-parameters amp velocity generator

(1) Preprocessing Phase (2) Initialization Phase (3) Evolution Phase (4) Finishing Phase

No (15) Check Stop conditions

satisfied

(1) Input N S Vmin Vmax

klarr1 to N

(5) Input T Z C1 C2 W tmax

(6) Pibest larrminusinfin i larr1 to S(7) Gbest larr minusinfin

Compute Flowast(P) and update Pibest

(9) Update Gbest

(10) tlarr1

Compute V P Flowast(P) and Pibest

(13) Update Gbest

(14) tlarrt+1

(11) Compute r1(t) and r2(t)H larr Gbest

Figure 3 The flowchart of the proposed algorithm

Else

Let Y119896 be the set of all possible values of h119896

Let user enter all elements of the set Y119896

End of elseEnd of for

Step 2 Let 119865lowast be the fitness function which constructs DNNtuned with the given hyperparameters then trains DNN on119879 and tests it on 119885 Finally 119865lowast computes the accuracy ofDNN as output

Step 3 Let G119887119890119904119905 be the global best vector of the swarm oflength N

Let GS be the best fitness score of the swarmGSlarr997888 minusinfin

Step 4 For ilarr9978881 to SLet P119894 be the position vector of the 119894th particle oflength NLet V 119894 be the velocity vector of the 119894th particle oflength NLet 119875119894119887119890119904119905 be the personal best vector of the 119894thparticle of length NLet PS119894 be the fitness score of the personal bestvector of the 119894th particleFor jlarr9978881 to N

If domain of h119895 is continuous thenselect h119895 uniformly distributed

119875[119895] larr997888 U(119861119895119897119900119908

119861119895119906119901)End of ifElse

Select h119895 randomly by 119875119894[j] larr997888RAND (Y119895)

End of else119881119894[119895] larr997888 U(119881119898119894119899 119881119898119886119909)

End of for119875119894119887119890119904119905 larr997888 119875119894Let FS119894 be the fitness score of the 119894th particle

119865119878119894 larr997888 119865lowast(119875119894)119875119878119894 larr997888 119865119878119894If FS119894 gt GS then

119866119887119890119904119905 larr997888 119875119894119866119878 larr997888 119865119878119894

End of ifEnd of for

Step 5 Let GS119901119903119907 be the previous best fitness score of theswarm

119866119878119901119903V larr997888 119866119878Let 1199031 and 1199032 be random values in PSOLet 119905 be the current iterationFor tlarr9978881 to t119898119886119909

1199031 larr997888 119880(0 1)1199032 larr997888 119880(0 1)For ilarr997888 1 to S

Update V 119894 according to (1)Update P119894 according to (2)119865119878119894 larr997888 119865lowast(119875119894)If FS119894 gt PS119894 then119904119904119904119875119894119887119890119904119905 larr997888 119875119894119875119878119894 larr997888 119865119878119894End of ifIf PS119894 gt GS then119866119887119890119904119905 larr997888 119875119894119887119890119904119905119866119878 larr997888 119875119878119894End of if

End of forIf 119866119878- 119866119878119901119903V lt then

go to Step 6End of if

Security and Communication Networks 9

Table 4 PSO parameters recommended values or ranges

Parameter ValueRange119878 [5 20]119881119898119894119899 0119881119898119886119909 11198621 21198622 2119882 [04 09]119905119898119886119909 [30 50] 00001

119866119878119901119903V larr997888 119866119878End of for

Step 6 Let119867 be the optimal hyperparameters vector119867 larr997888 119866119887119890119904119905Return119867 and Terminate

44 PSO Parameters Selection of the value of PSO param-eters (S V119898119886119909 V119898119894119899 1198621 1198622 W t119898119886119909 ) is a very complexprocess Fortunatelymany empirical and theoretical previousstudies have been published to solve this problem [37ndash40] They introduced some recommended values of PSOparameters which can be taken Table 4 shows every PSOparameter and the corresponding recommended value orrange Thus for those parameters which have recommendedranges we can select a value for each parameter from its rangerandomly and fix it as a constant during the execution of PSO

5 Experimental Setup and Models

This section explains the methodology of performing ourempirical experiments as well as the description of deeplearning models which we used to detect masquerades Asmentioned in Section 3 we selected three UNIX commandline-based datasets (SEA Greenberg PU) Each of thesedatasets is a collection of text files inwhich each text file repre-sents a userThe text file of each user in the particular datasetcontains a set of UNIX commands that are issued by that userThis reflects the fact that these datasets do not contain anyreal masqueraders However to simulate masqueraders andto use these datasets in masquerade detection special dataconfigurations must be implemented prior to proceeding inour experiments According to Section 3 and its subsectionseach dataset has its two different types of data configurationsTherefore we obtained six data configurations that each onewill be observed separately which yields in the result to sixindependent experiments for each model Finally masquer-ade detection can be applied to these data configurationsby following two different main approaches namely staticclassification and dynamic classificationThe two subsequentsubsections present the difference between them as well aswhich deep learning models are exploited for each one

51 Static Classification Approach In the static classificationapproach the classification task is carried out using a dataset

of samples which are represented by a set of static features[30] These static features are defined according to the natureof the task where the classification will be applied In additionto that the dataset samples or also called observations arecollected manually by some experts working in the field ofthat classification task After that these samples are split intotwo independent sets known as training and test sets to trainand test the selected model respectively Static classificationapproach has pros and cons as well Although it provides afaster and easier solution it requires a ready-to-use datasetwith static features The existence of such dataset might notbe available in some complex classification tasks Hence theattempt to create a dataset with static features will be a hardmission In our work we decided to utilize the existenceof three famous UNIX command line-based datasets toimplement six different data configurations Each user inthe particular data configuration has a specific number ofblocks which are represented by a set of static featuresIndeed these features are the userrsquos UNIX commands incharge of describing the behavior of that user and laterhelping the classifier to detect masquerades We decided touse two well-known deep learning models namely DeepNeural Networks (DNN) and Recurrent Neural Networks(RNN) to accomplish the staticmasquerade detection task onthe implemented six data configurations

511 Deep Neural Networks In Section 4 we explained indetail the DNN structure and the problem of the selection ofits hyperparameters We also proposed PSO-based algorithmto obtain the optimal hyperparameters vector thatmaximizedthe accuracy of the DNN on the given training and test setsIn this subsection we describe how we utilized the proposedPSO-based algorithm and the DNN in static masqueradedetection task using six of data configurations which areSEA SEA 1v49 Greenberg Truncated Greenberg EnrichedPU Truncated and PU Enriched Every data configurationof them has its structure and a specific number of users asdescribed in Section 3 So we will have six separate DNN-experiments and each experiment will be on one of the dataconfigurations

The methodology of our DNN-experiments consists offour consecutive stages which are initialization optimiza-tion results extraction and finishing stages The first stageis to initialize all required operating parameters as well asto prepare the particular data configurationrsquos files in whicheach file represents a user in that data configurationThe userfile consists of the training set followed by the test set of thatuser We set all PSO parameters for all DNN-experiments asfollows S=20 V119898119894119899=0 V119898119886119909= 1 1198621=1198622=2 W=09 t119898119886119909=30and =10minus4 Then the last step in the initialization stage is todefine hyperparameters of the DNN and their domains Weused twelve different DNN hyperparameters (N=12) Table 5shows each DNN hyperparameter and its correspondingdefined domain All the used hyperparameters are numericalexcept that Optimizer Layer type Initialization function andActivation function hyperparameters are categorical In thiscase a list of all possible values is indexed to a sequenced-numbered range from 1 to the length of that list Optimizerlist includes elements Adagrad Nadam Adam Adamax

10 Security and Communication Networks

Table 5 The used DNN hyperparameters and their domains

Hyperparameter Domain DescriptionLearning rate [001 09] ContinuousMomentum [01 09] ContinuousDecay [0001 001 ContinuousDropout rate [01 09] ContinuousNumber of hidden layers [1 10] Discrete with step=1Numbers of neurons of hidden layers [1 100] Discrete with step=1Number of epochs [5 20] Discrete with step=5Batch size [100 1000] Discrete with step=50Optimizer [1 6] Discrete with step=1Initialization function [1 8] Discrete with step=1Layer type [1 2] Discrete with step=1Activation function [1 8] Discrete with step=1

RMSprop and SGD Layer type list contains two elementswhich are Dropout and Dense Initialization function listincludes elements Zero Normal Lecun uniform UniformGlorot uniform Glorot normal He uniform and He normalFinally Activation list has eight elements which are LinearSoftmax ReLU Sigmoid Tanh Hard Sigmoid Softsign andSoftplus It is worth mentioning that the elements of all cate-gorical hyperparameters are defined inKeras implementation[30]

The optimization and results extraction stages will beperformed once for each user in the particular data configu-ration that is they will be repeated for each user119880119894 i=12 M where 119872 is the number of users in the particular dataconfiguration119863The optimization stage starts by splitting thedata of the user119880119894 into two independent sets119879119894 and119885119894 whichare the training and test sets of the ith user respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 All blocks ofthe training and test sets are converted from text to numericvalues and then are normalized in [0 1] After that wesupplied these sets to the proposed PSO-based algorithm tofind the optimized hyperparameters vector119867119894 for the ith userIn addition to that we will save a copy of 119867119894 values in adatabase in order to save time and use them again in theRNN-experiment of that particular data configuration D aswill be presented in Section 512 The results extraction stagetakes place when constructing the DNN that is tuned by 119867119894trains the DNN on 119879119894 and tests the DNN on119885119894 The values ofthe classification outcomes True Positive (TP119894) False Positive(FP119894) True Negative (TN 119894) and False Negative (FN 119894) for theith user in the particular data configuration 119863 are extractedand saved for further processing later

Then the next user is observed and same procedure ofoptimization and results extraction stages is performed tillthe last user in the particular data configuration119863 is reachedFinally when all users in the particular data configurationare completed the last stage (finishing stage) is executedFinishing stage computes the summation of all obtained TPsof all users in the particular data configuration 119863 denotedby TP The same process will be applied also to the otheroutcomes namely FP TN and FN Equations (3) (4)

(5) and (6) express the formulas of TP FP TN and FNrespectively

119879119875 = 119872sum119894=1

119879119875119894 (3)

119865119875 = 119872sum119894=1

119865119875119894 (4)

119879119873 = 119872sum119894=1

119879119873119894 (5)

119865119873 = 119872sum119894=1

119865119873119894 (6)

The finishing stage will report and save these outcomes andend the DNN-experiment for the particular data configura-tion 119863 The former outcomes will be used to compute tenwell-known evaluation metrics to assess the performanceof the DNN on the particular data configuration D as willbe presented in Section 6 It is worth saying that the sameprocedure which is explained above will be done for eachdata configuration Figure 4 depicts the flowchart of themethodology of the DNN-experiments

512 Recurrent Neural Networks TheRecurrent Neural Net-work is a special type of the traditional feed-forwardArtificialNeural Network Unlike traditional ANN in the RNN eachneuron in any of the hidden layers has additional connectionsfrom its output to itself (self-recurrent) as well as to otherneurons of the same hidden layer Therefore the output ofthe RNNrsquos hidden layer at any time step (t) is for the currentinputs and the output of the hidden layer at the previous timestep (t-1) In RNN these directed cycles allow informationto circulate in the network and make the hidden layers asthe storage unit of the whole network [41] The importantcharacteristics of the RNN are the capability to have memoryand generate periodical sequences

Despite that the conventional RNN structure which isdescribed above has a serious problem especially when the

Security and Communication Networks 11

(9) Construct DNN that is tuned by Hi

(10) Train DNN on Ti

(11) Test DNN on Zi

No

(16) OutputTP FP TN and FN

Yes

End

Start

(1) Input Data configuration D M

(2) Set PSO parameters values

(3) Define Domains for Hyper-parameters

(3) Results Extraction Stage (4) Finishing Stage(1) Initialization Stage (2) Optimization Stage

(6) Execute the proposed PSO-based algorithm

(15) Compute and save TP FP TN and FN for D

(8) Database

(4) ilarr1

(7) Obtain Hi of the user Ui

(5) Create Ti and Zi sets of the user Ui

Hi

(12) Obtain and save TPi FPi TNi andFNi for the user Ui

(14) Is i gt M

(13) ilarri+1

Figure 4 The flowchart of the DNN-experiments

Inputxt

it ctℎt

Outputot

ft

Figure 5 The structure of an LSTM cell [6]

RNN is trained using the back-propagation technique Theproblem is known as gradient vanishing and exploding [42]The gradient vanishing problem occurs when the gradientsignal gets so small over the network which causes learningto become very slow or stop On the other hand the gradientexploding problem occurs when the gradient signal gets solarge in which learning divergesThis problem of the conven-tional RNN limited the use of the RNN to be only in short-termmemory tasks To solve this problem a new architectureof RNN is proposed by Hochreiter and Schmidhuber [43]known as Long Short-Term Memory (LSTM) LSTM uses anew structure called a memory cell that is composed of fourparts which are an input gate a neuron with a self-recurrentconnection a forget gate and the output gateMeanwhile themain goal of using a neuron with a self-recurrent connectionis to record information the aim of using three gates is tocontrol the flow of information from or into the memory cellThe input gate decides if to allow the incoming informationto enter into the memory cell or block it Moreover the forgetgate controls if to pass the previous state of the memory cellto alter the current state of the memory cell or prevent itFinally the output gate determines if to pass the output ofthe memory cell or not Figure 5 shows the structure of anLSTM memory cell Rather than overcoming the problemsof the conventional RNN LSTM model also outperformsthe conventional RNN in terms of performance especially inlong-term memory tasks [5] The LSTM-RNN model can beobtained by replacing every neuron in the hidden layers ofthe RNN to an LSTMmemory cell [6]

In this study we used the LSTM-RNN model to performa static masquerade detection task on all data configurationsAs mentioned in Section 511 there are six data config-urations and each of them will be used in the separate

experiment So we will have six separate LSTM-RNN-experiments each experiment will be on one of the dataconfigurations The methodology of all of these experimentsis the same and as follows for the given data configurationD we firstly prepared all the given data configurationrsquos filesby converting all blocks from text to numerical values andthen normalizing them in [0 1] Next to that for each user119880119894 in D where i=12 M and 119872 is the number of users inD we did the following steps we split the data of 119880119894 into twoindependent sets 119879119894 and 119885119894 which are the training and testsets of the ith user in D respectively The splitting processfollowed the structure of the particular data configurationwhich is described in Section 3 After that we retrieved thestored optimized hyperparameters vector of the ith user (119867119894)from the database which is created in the previous DNN-experiments Then we constructed the RNN model that istuned by119867119894 In order to obtain the LSTM-RNNmodel everyneuron in any of the hidden layers is replaced to an LSTMmemory cell The constructed LSTM-RNN model is trainedon119879119894 and then tested on119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 6 depicts theflowchart of the methodology of LSTM-RNN-experiments

52 Dynamic Classification Approach In contrast of staticclassification approach dynamic classification approach doesnot need a ready-to-use dataset with static features [30] Itcovenants directly with raw data sources such as text imagevideo sound and signal files and extracts features from themdynamically The models that use this approach try to learnand represent features in unsupervised manner Then thesemodels train themselves using the extracted features to beable to classify unseen dataThe deep learningmodels fit verywell for this approach because the main objectives of deeplearning models are the strong ability of automatic featureextraction and self-learning Rather than that dynamicclassification models overcome the problem of the lake ofdatasets it performs more efficient than the static classifica-tionmodels Despite these advantages dynamic classificationapproach has also drawbacks Dynamic classification modelsare slower and take a long time to train if compared with

12 Security and Communication Networks

YesNo

Hi

Start

(1) InputData configuration D M

(2) Prepare files of D

(4) Split data of Ui

into Ti and Zi sets

(7) Train LSTM-RNN model on Ti

(8) Test LSTM-RNN model on Zi

End

(5) Database

(6) Construct LSTM-RNN model that is tuned by Hi

(3) ilarr1

(9) Obtain and save TPi FPi TNi andFNi for the user Ui

(10) ilarri+1

(11) Is i gt M

(13) Output TPFP TN and FN

(12) Compute andsave TP FP TN

and FN for D

Figure 6 The flowchart of the LSTM-RNN-experiments

static classification models due to complex deep structure ofthesemodels as well as the huge amount of computations thatare required to execute Furthermore dynamic classificationmodels require a very large amount of input samples to gainhigh accuracy values

In this research we used six data configurations that areimplemented from three textual datasets In order to applydynamic masquerade detection on these data configurationswe need amodel that is able to extract features from the userrsquoscommand text file dynamically and then classify the user intoone of the two classes that will be either a normal user or amasqueraderTherefore we dealwith a text classification taskThe text classification is defined as a task that assigns a pieceof text (a word a sentence or even a document) to one ormore classes according to its content Indeed there are threetypes of text classification namely sentence classificationsentiment analysis and document categorization In sentenceclassification a given sentence should be assigned correctlyto one of possible classes Furthermore sentiment analysisdetermines if a given sentence is a positive negative orneutral towards a specific subject In contrast documentcategorization deals with documents and determines whichclass from a given set of possible classes a document belongsto According to the nature of dynamic classification as well asthe functionality of text classification deep learning modelsare the fittest among the other machine learning models forthese types of classification due to their powerful capability offeatures learning

A wide range of researches have been accomplished inthe literature in the field of text classification using deeplearning models It was started by LeCun et al in 1998 whenthey proposed a special topology of the Convolutional NeuralNetwork (CNN) known as LeNet family and used it in textclassification efficiently [44]Then various studies have beenpublished to introduce text classification algorithms as wellas the factors that impact the performance [45ndash47] In thestudy [48] the CNNmodel is used for sentence classificationtask over a set of text dataset benchmarks A single one-dimensional CNN is proposed to learn a region-based textembedding [49] X Zhang et al introduced a novel character-based multidimensional CNN for text classification taskswith competitive results [50] In the research [51] a newhierarchal approach calledHierarchal Deep Learning for Text

classification (HDLTex) is proposed and three deep struc-tures which are DNN RNN and CNN are used A recurrentconvolutional network model is introduced [52] for textclassification and high results are obtained on documents-level datasets A novel LSTM-based model is introduced andused for text classification withmultitask learning framework[53] The study [54] proposed a new model called hierarchalattention network for document classification and is testedon six large document-level datasets with good results Acharacter-level text representations approach is proposed andtested for text classification tasks using deep CNN [55]As noticed the CNN is the mostly used deep learningmodel for text classification tasks So we decided to use theCNN to perform dynamic masquerade detection on all dataconfigurations The following subsection reviews the CNNand explains the structure of the used CNN model and themethodology of our CNN-experiments

521 Convolutional Neural Networks The ConvolutionalNeural Network (CNN) is a deep learning model whichis biological-inspired from the animal visual cortex TheCNN can be considered as a special type of the traditionalfeed-forwardArtificial Neural NetworkThemajor differencebetween ANN and CNN is that instead of the fully connectedarchitecture of ANN the individual neurons in CNN areconnected to subregions of the input field The neurons ofthe CNN are arranged in such a way they are tilled to coverthe entire input field The typical CNN consists of five maincomponents namely an input layer the convolutional layerthe pooling layer the fully connected layer and an outputlayer The input layer is where the input data is enteredinto the CNN The first convolutional layer in the CNNconsists of individual neurons that each of them is connectedto a small subset of the input field The neurons in thenext convolutional layers connect only to a subset of theirpreceding pooling layerrsquos outputMoreover the convolutionallayers in the CNN use a set of learnable kernels or filters thateach filter is applied to the specified subset of their precedinglayerrsquos output These filters calculate feature maps in whicheach feature map shares the same weights The poolinglayer also known as a subsampling layer is a nonlineardownsampling function that condenses subsets of its inputThemain goal of using pooling layers in the CNN is to reduce

Security and Communication Networks 13

Userrsquos Command Text Files

Quantization

Input Layer

Convolutional layer

C1 features map P1 features map

Max-Pooling layer

C2 P2 C6 P6

Fully-Connected dropout layers

2048 sigmoid neurons

2048 sigmoid neurons 2

softmaxneurons

Outputdense layer

0 (Normal)1 (Masquerader)

Figure 7 The architecture of the used CNNmodel

the complexity and computations by reducing the size of theirpreceding layerrsquos output There are many pooling nonlinearfunctions that can be used but among them max-poolingis the mostly used which selects the maximum value in thegiven pooling window Typically each convolutional layer inthe CNN is followed by a max-pooling layer The CNN hasone or more stacked convolutional layer and max-poolinglayer pairs to extract features from the entire input and thenmap these features to their next fully connected layerThe toplayers of the CNN are one or more of fully connected layerswhich are similar to hidden layers in the DNN This meansthat neurons of the fully connected layers are connected to allneurons of the preceding layer The output layer is the finallayer in the CNN and is responsible for reporting the outputvalue of the CNN Finally the back-propagation algorithm isusually used to train CNNs via Stochastic Gradient Decent(SGD) to adjust the weights of the fully connected layers [56]There are several variant structures of CNN that are proposedin the literature but LeNet structure which is proposed byLeCun et al [44] is themost common approach used inmanyapplications of computer vision and text classification

Regarding its stability and high efficiency in text clas-sification we selected the CNN model which is proposedin [50] to perform a dynamic masquerade detection on alldata configurationsThe usedmodel is a character-level CNNthat takes a text file as input and outputs the classificationscore (0 if the input text file is related to a normal user or1 otherwise) The used CNN model is from LeNet familyand consists of an input layer followed by six convolutionand max-pooling pairs followed by two fully connectedlayers and finally followed by an output layer In the inputlayer the text quantization process takes place when theused model encodes all letters in the input text file using aone-hot representation from a 70-character alphabet All theconvolutional layers in the used CNN model have a ReLUnonlinear activation functionThe two fully connected layersin the used CNN model are of the type dropout layer withdropout probability equal to 05 In addition to that the twofully connected layers in the usedCNNmodel have a Sigmoidnonlinear activation function as well as they have the samesize of 2048 neurons of each The output layer in the usedCNN model is of the type dense layer as well as it has asoftmax activation function and size of two neurons Theused CNN model is trained by back-propagation algorithmvia SGD Finally we set the following parameters to the

used CNN model learning rate=001 epochs=30 and batchsize=64 These values are obtained experimentally by per-forming a grid search to find the best possible values of theseparameters Figure 7 shows the architecture of the used CNNmodel and is reproduced from Zhang et al (2015) [under theCreative Commons Attribution Licensepublic domain]

In our work we used a CNNmodel to perform a dynamicmasquerade detection task on all data configurations Asmentioned in Section 511 there are six data configurationsand each of them will be used in the separate experimentSo we will have six separate CNN-experiments and eachexperiment will be on one of the data configurations Themethodology of all of these experiments is the same and asfollows for the given data configurationD we firstly preparedall the given data configurationrsquos text files such that each file ofthem represents the training and test sets of a user in119863 Nextto that for each user 119880119894 in D where i=12 M and119872 is thenumber of users in D we did the following steps we split thedata of 119880119894 into two independent sets 119879119894 and 119885119894 which are thetraining and test sets of the ith user in D respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 Furthermorewe also moved each block in the training and test sets of theuser 119880119894 to a separate text file This means that each of thetraining and test sets of the user 119880119894 consists of a specifiednumber of text files in which each text file contains one blockof UNIX commands After that we constructed the usedCNN model The constructed CNN model is trained on 119879119894and then tested on 119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 8 depicts theflowchart of the methodology of CNN-experiments

6 Results and Discussion

We carried out three major empirical experiments whichareDNN-experiments LSTM-RNN-experiments andCNN-experiments Each of them consists of six separate subex-periments where each subexperiment is performed on oneof the data configurations SEA SEA 1v49 Greenberg Trun-cated Greenberg Enriched PU Truncated and PU Enriched

14 Security and Communication Networks

YesNo

Start

(1) Input

(2) Prepare text files of D

(4) Split data of Ui

Ti and Zi text sets(6) Construct the used CNN model

(7) Train CNN model on Ti

(8) Test CNN model on Zi

(13) Output TP FP TN and FNEnd

(5) Move each block in Ti and Zi to a separate text file

Data configuration D M

(3) ilarr1

(12) Compute and save TPFP TN and FN for D

(9) Obtain and save TPi FPi TNiand FNi for the user Ui

(11) Is i gt M

(10) ilarri+1

into

Figure 8 The flowchart of the CNN-experiments

Table 6 The confusion matrix of the masquerade detection out-comes

Actual Class Predicted ClassNormal User Masquerader

Normal User TN FPMasquerader FN TP

Basically our PSO-based DNN hyperparameters selectionalgorithmwas implemented in Python 364 [57]withNumPy[58] Moreover all models (DNN LSTM-RNN CNN) wereconstructed and trained and tested based on Keras [59 60]with TensorFlow 16 [61 62] that backend over CUDA 90[63] and cuDNN 70 [64] In addition to that all experimentswere performed on a workstation with an Intel Core i7 CPU(38GHz 16 MB Cache) 16GB of RAM and theWindows 10operating system In order to accelerate the computations inall experiments we also used a GPU-accelerated computingwith NVIDIA Tesla K20 GPU 5GB GDDR5The experimen-tal environment is processed in 64-bit mode

In any classification task we have four possible outcomesTrue Positive (TP) True Negative (TN) False Positive (FP)and False Negative (FN) We get a TP when a masqueraderis correctly classified as a masquerader Whenever a gooduser is correctly classified as a good user itself we say it isa TN A FP occurs when a good user is misclassified as amasquerader In contrast FN occurs when a masqueraderis misclassified as a good user Table 6 shows the ConfusionMatrix of the masquerade detection outcomes For eachdata configuration we used the obtained outcomes for thatdata configuration to compute twelve well-known evaluationmetrics After that by using these evaluation metrics weassessed the performance of each deep learningmodel on thatdata configuration

For simplicity we divided these evaluation metrics intotwo categories General Classification Measures and Mas-querade Detection Measures The General ClassificationMeasures are metrics that are used for any classification tasknamely Accuracy Precision Recall and F1-Score On theother handMasquerade DetectionMeasures are metrics thatusually are used for a masquerade or intrusion detection

task which are Hit Rate Miss Rate False Alarm RateCost Bayesian Detection Rate Bayesian True Negative RateGeometric Mean andMatthews Correlation CoefficientTheused evaluation metrics definition and their correspondingequations are as follows

(i) Accuracy shows the rate of true detection over all testsets

119860119888119888119906119903119886119888119910 = 119879119875 + 119879119873119879119875 + 119879119873 + 119865119875 + 119865119873 (7)

(ii) Precision shows the rate of correctly classified mas-queraders from all blocks in the test set that areclassified as masqueraders

119875119903119890119888119894119904119894119900119899 = 119879119875119879119875 + 119865119875 (8)

(iii) Recall shows the rate of correctly classified masquer-aders over all masquerader blocks in the test set

119877119890119888119886119897119897 = 119879119875119879119875 + 119865119873 (9)

(iv) F1-Score gives information about the accuracy of aclassifier regarding both Precision (P) and Recall (R)metrics

1198651 119878119888119900119903119890 = 21119875 + 1119877 (10)

(v) Hit Rate shows the rate of correctly classified mas-querader blocks over all masquerader blocks pre-sented in the test set It is also called Hits TruePositive Rate or Detection Rate

119867119894119905 119877119886119905119890 = 119879119875119879119875 + 119865119873 (11)

(vi) Miss Rate is the complement of Hit Rate (Miss=100-Hit) ie it shows the rate of masquerade blocksthat are misclassified as a normal user from allmasquerade blocks in the test set It is also calledMisses or False Negative Rate

119872119894119904119904 119877119886119905119890 = 119865119873119865119873 + 119879119875 (12)

Security and Communication Networks 15

(vii) False Alarm Rate (FAR) gives information about therate of normal user blocks that are misclassified as amasquerader over all normal user blocks presented inthe test set It is also called False Positive Rate

119865119886119897119904119890 119860119897119886119903119898 119877119886119905119890 = 119865119875119865119875 + 119879119873 (13)

(viii) Cost is a metric that was proposed in [9] to evaluatethe efficiency of a classifier concerning bothMiss Rate(MR) and False Alarm Rate (FAR) metrics

119862119900119904119905 = 119872119877 + 6 times 119865119860119877 (14)

(ix) Bayesian Detection Rate (BDR) is a metric basedon Base-Rate Fallacy problem which is addressedby S Axelsson in 1999 [65] Base-Rate Fallacy is abasis of Bayesian statistics and occurs when peo-ple do not take the basic rate of incidence (Base-Rate) into their account when solving problems inprobabilities Unlike Hit Rate metric BDR shows therate of correctly classified masquerader blocks overall test set taking into consideration the base-rate ofmasqueraders Let I and Ilowast denote a masquerade anda normal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectivelyThen BDR can be computed asthe probability P(I | A) according to (15) [65]119861119886119910119890119904119894119886119899 119863119890119905119890119888119905119894119900119899 119877119886119905119890 = 119875 (119868 | 119860)

= 119875 (119868) times 119875 (119860 | 119868)119875 (119868) times 119875 (119860 | 119868) + 119875 (119868lowast) times 119875 (119860 | 119868lowast)(15)

P(I) is the rate of the masquerader blocks in the testset P(A | I) is the Hit Rate P(Ilowast) is the rate of thenormal blocks in the test set and P(A | Ilowast) is the FAR

(x) Bayesian True Negative Rate (BTNR) is also basedon Base-Rate Fallacy and shows the rate of trulyclassified normal blocks over all test set in which thepredicted normal behavior indicates really a normaluser [65] Let I and Ilowast denote a masquerade and anormal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectively Then BTNR can be computedas the probability P(Ilowast | Alowast) according to (16) [65]

119861119886119910119890119904119894119886119899 119879119903119906119890 119873119890119892119886119905119894V119890 119877119886119905119890 = 119875 (119868lowast | 119860lowast)= 119875 (119868lowast) times 119875 (119860lowast | 119868lowast)

119875 (119868lowast) times 119875 (119860lowast | 119868lowast) + 119875 (119868) times 119875 (119860lowast | 119868)(16)

P(Ilowast) is the rate of the normal blocks in the test setP(Alowast | Ilowast) is the True Negative Rate which is easilyobtained by calculating (1-FAR) P(I) is the rate of themasquerader blocks in the test set and P(Alowast | I) isthe Miss Rate

(xi) Geometric Mean (g-mean) is a performance metricthat combines true negative rate and true positive

rate at one specific threshold where both the errorsare considered equal This metric has been usedby several researchers for evaluating classifiers onimbalance dataset [66] It can be computed accordingto (17) [67]

119892 119898119890119886119899 = radic 119879119875 times 119879119873(119879119875 + 119865119873) times (119879119873 + 119865119875) (17)

(xii) Matthews Correlation Coefficient (MCC) is a perfor-mance metric that takes into account true and falsepositives and negatives and is generally regarded asa balanced measure which can be used even if theclasses are of very different sizes (imbalance dataset)[68] MCC has a range of minus1 to 1 where minus1 indicates acompletely wrong binary classifier while 1 indicates acompletely correct binary classifier Unlike the othermetrics discussed aboveMCC takes all the cells of theConfusion Matrix into consideration in its formulawhich can be computed according to (18) [69]

119872119862119862= (119879119875 times 119879119873) minus (119865119875 times 119865119873)radic(119879119875 + 119865119873) times (119879119875 + 119865119875) times (119879119873 + 119865119875) times (119879119873 + 119865119873)

(18)

In the following two subsections we will present our experi-mental results and explain them using two kinds of analysesperformance analysis and ROC curves analysis

61 Performance Analysis The effectiveness of any modelto detect masqueraders depends on its values of evaluationmetrics The higher values of Accuracy Precision RecallF1-Score Hit Rate Bayesian Detection Rate Bayesian TrueNegative Rate Geometric Mean and Matthews CorrelationCoefficient as well as the lower values of Miss Rate FalseAlarm Rate and Cost indicate an efficient classifierThe idealclassifier hasAccuracy andHit Rate values that reach 1 as wellasMiss Rate and False AlarmRate values that reach 0 Table 7presents the percentages of the used evaluation metricsfor DNN-experiments LSTM-RNN-experiments and CNN-experiments Actually the rows labeled by DNN and LSTM-RNN in Table 7 show results of the static masquerade detec-tion by using DNN and LSTM-RNN models respectivelywhereas the rows labeled by CNN in Table 7 show resultsof the dynamic masquerade detection by using CNN modelFurthermore the bold rows represent the best results amongthe same data configuration whereas the underlined valuesare the best for all data configurations

First of all the impact of using our PSO-based algorithmcan be seen in the obtained results of both DNN and LSTM-RNN models The PSO-based algorithm is used to optimizethe selection of DNN hyperparameters that maximized theaccuracy which means that the sum of TP and TN outcomeswill be increased significantly Thus according to (11) and(13) increasing the sum of TP and TN will lead definitelyto the increase of the value of Hit as well as to the decreaseof the value of FAR Although the accuracy values of SEA1v49 data configuration for all models are slightly lower than

16 Security and Communication Networks

Table 7 The results of our experiments

Dataset DataConfiguration Model Evaluation Metrics ()

Accuracy Precision Recall F1-Score Hit Miss FAR Cost BDR BTNR g-mean MCC

SEA Dataset

SEADNN 9808 7626 8485 8033 8485 1515 128 2283 7625 9926 9152 7945

LSTM-RNN 9852 8230 8658 8439 8658 1342 090 1883 8233 9934 9263 8364CNN 9884 8777 8701 8739 8701 1299 059 1651 8772 9937 93 8678

SEA 1v49DNN 9654 9998 9643 9817 9643 357 048 647 9998 5204 9796 7064

LSTM-RNN 9786 9998 9779 9887 9779 221 038 448 9998 6370 987 7874CNN 9878 9999 9874 9936 9874 126 019 240 9999 7551 9927 8622

GreenbergDataset

GreenbergTruncated

DNN 9397 9223 8067 8606 8067 1933 204 3157 9222 9441 8889 8253LSTM-RNN 9472 9488 8153 8770 8153 1847 132 2639 9487 9468 897 8476

CNN 9543 9616 8353 8940 8353 1647 10 2247 9616 9524 9094 8686

GreenbergEnriched

DNN 9757 9692 9240 9461 9240 760 088 1288 9692 9775 957 9308LSTM-RNN 9798 9757 9360 9554 9360 640 070 1060 9756 9810 9641 9428

CNN 9860 9855 9533 9692 9533 467 042 719 9855 9861 9743 9603

PU Dataset

PU TruncatedDNN 810 9959 7861 8786 7861 2139 225 3489 9959 3949 8766 5463

LSTM-RNN 8219 9969 7989 8870 7989 2011 175 3061 9968 4110 886 5646CNN 8375 9974 8164 8979 8164 1836 150 2736 9973 4338 8968 5879

PU EnrichedDNN 9044 9984 8921 9423 8921 1079 10 1679 9984 5672 9398 7064

LSTM-RNN 9131 9988 9018 9478 9018 982 075 1432 9988 5908 9461 7261CNN 9375 9992 9293 9630 9293 707 050 1007 9992 6678 9616 7852

the corresponding values of SEA data configuration also Hitvalues are dramatically increased in SEA 1v49 for all modelsby 10-14 from those that are in the SEA data configurationThis is due to the structure of SEA 1v49 data configurationwhere there are 122500 masquerader blocks in the test setof SEA 1v49 comparing to only 231 blocks in the SEA dataconfiguration Moreover the FAR values of SEA 1v49 for allmodels are significantly lower than the corresponding valuesof SEA data configuration Hence regarding SEA datasetSEA 1v49 is better to use in masquerade detection than SEAdata configuration

On the other hand as we expected Greenberg Enrichedenhanced noticeably the performance of all models in termsof all used evaluation metrics from the corresponding val-ues of Greenberg Truncated data configuration This canbe explained by the fact that Greenberg Enriched dataconfiguration has more information about user behaviorincluding command name parameters aliases and flagscomparing to only command name in Greenberg TruncatedTherefore regarding Greenberg dataset Greenberg Enricheddata configuration is better to use in masquerade detectionthan Greenberg Truncated The same thing happened inPU dataset where its PU Enriched data configuration hasbetter results regarding all models than PU Truncated Thusregarding PU dataset PU Enriched is better to use inmasquerade detection than PUTruncated data configuration

Actually PU Truncated and Greenberg Truncated dataconfigurations simulate SEA and SEA 1v49 data configu-rations where only command name is considered Despitethat regarding all used models SEA 1v49 recorded thebest results among the other truncated data configurationsOn the other hand PU Enriched and Greenberg Enriched

are considered as enriched data configurations where extrainformation about users is taken into consideration Due tothat enriched data configurations help models to build userrsquosbehavior profile more accurately than with truncated dataconfigurations Regarding all models the results associatedwithGreenberg Enriched especially in terms ofAccuracyHitand FAR values are better than of the corresponding valuesof PU Enriched data configuration because PU dataset isvery small masquerade detection dataset with a relatively lownumber of users (only 8 users) Also this reason can explainwhy a few previous works used PU dataset in masqueradedetection However data configurations can be sort for allused models from the upper to lower according to theobtained results as follows SEA 1v49 Greenberg EnrichedPU Enriched SEA Greenberg Truncated and PUTruncated

For the sake of brevity and space limitation we selected asubset of the used performancemetrics inTable 7 to be shownvisually in Figures 9 and 10 Figures 9(a) 9(b) 9(c) 9(d)9(e) 9(f) 9(g) and 9(h) showAccuracy HitMiss FAR CostBDR F1-Score and MCC percentages of the used modelsin each data configuration respectively Figures 10(a) 10(b)10(c) 10(d) 10(e) and 10(f) show Accuracy Hit FAR BDRF1-Score and MCC percentages for the average performanceof the used models on datasets respectively Figures 9 and10 can give us a visual comparison of the performance of theused deep learning models for each data configuration anddataset as well as in all datasets

By taking an inspective look to Figures 9 and 10 we cannotice the stability of deep learning models in such a waythat they are enhancing masquerade detection from a dataconfiguration to another in a consistent pattern To explainthat we will discuss the obtained results from the perspective

Security and Communication Networks 17

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnriched

PU EnrichedPU Truncated

0102030405060708090

100

Accura

cy (

)

(a)

0102030405060708090

100

Hit

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(b)

0

5

10

15

20

25

Miss

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(c)

002040608

112141618

22224

FAR

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(d)

0

5

10

15

20

25

30

35

Cos

t (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(e)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU Truncated

0102030405060708090

100

BDR

()

PU Enriched

(f)

Figure 9 Continued

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 7: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

Security and Communication Networks 7

YesNo

Start

(2) Initialize P and Vvectors particlesSof

each lengthNof

(5) For all S particles Compute F(P) and update Pi

best(6) Update Gbest

(8) Compute r1(t) and r2(t)(9) For all S particles

Update V P F(P) and Pibest

(10) Update Gbest

(12) Check Stop Criterion

satisfied

(13) Output Gbest as the optimal solution

Terminate

maxWF t

(1) Input SN C1 C2 (3) Pibest larr minusinfin i larr 1 to S

(4) Gbest larr minusinfin

(7) t larr 1

(11) t larr t+1

Figure 2 The flowchart of the standard PSO

score of the personal best vector of that particle (119865(119875119894119905 ) gt119865(119875119894119887119890119904119905)) Also PSO updates the global best vector of theswarm if any of the fitness score of the personal best vectorof the particles is bigger than the fitness score of the globalbest vector of the swarm (119865(119875119894119887119890119904119905) gt 119865(119866119887119890119904119905) i=1 to S)Then PSO checks the stop criterion and if one is satisfiedPSO will output the global best vector as the optimal solutionand terminate Else PSO will proceed to the next iterationand repeat the same procedure described in the first iterationabove until the stop criterion is reached

The stop criterion is satisfied when either the trainingerror is smaller than a predefined value () or the maximumnumber of iteration is reached Finally PSO performs betterthan GA in terms of simplicity and generality [36] PSO issimpler than GA because it contains only one operator andeasy to implement Also the generality of PSO means thatPSO does not need any modifications to be applied to anyoptimization problem as well as it is faster to converge to theoptimal solutionwhich decreases the computations and savesthe resources

42 DNN Hyperparameters Selection Using PSO The selec-tion of the hyperparameters of DNN can be interpreted as anoptimization task hence the main objective is to minimizethe loss function L(MT) where 119872 is the DNN model and119879 is the training set To achieve this goal we selected PSOto be our optimization algorithm that outputs the vectorof the optimized hyperparameters 119867 that minimized theloss function 119871 after constructed DNN model 119872 which istuned by the hyperparameters 119867 and trained on the trainingset 119879 The fitness function of our PSO-based algorithm isa function 119865lowast 119877119873 997888rarr 119877 that maps a real-valued vectorof hyperparameters that has a length of N to a real-valuedaccuracy value of the trained DNN that is tuned by thathyperparameters vector and tested on the test set 119885 Inother words our PSO-based algorithm finds the optimalhyperparameters vector among all possible combinations ofhyperparameters which yields to maximize the accuracy ofthe trained DNN on the test set Furthermore to ensurethe generality of our PSO-based algorithm which meansto be independent of the DNN that will be optimized andbe adapted easily to any classification task using DNN wewill allow the user to select which hyperparameters want touse in his work Therefore the user is responsible for usingour algorithm to define the number of the hyperparameters

as well as the type and domain of each parameter Thedomain of a parameter is the set of all possible values ofthat parameter After that our PSO-based algorithm willuse a special built-in generator that depends on the numberand domains of the defined parameters to initialize all theparticles (hyperparameters vectors) in the swarm

During the execution of the proposed algorithm andat each iteration the validation process is involved in theproposed algorithm to validate the updated position andvelocity vectors to be appropriate to the predefined rangesof parameters Finally in order to reduce computations andconverge faster two different stop conditions are checkedsimultaneously at the end of each iteration The first occurswhen the fitness score of the global best vector increasedless than a threshold which is specified by the userThe aim of the former condition is to guarantee that theglobal best vector cannot be improved further even if themaximumnumber of iterations is not reached yetThe secondcondition happens when the maximum number of iterationsis carried out Either the first or the second condition issatisfied then the proposed algorithm outputs the global bestvector as the optimal solution 119867 and terminates the searchprocess Figure 3 shows the flowchart of our PSO-basedDNNhyperparameters selection algorithm

43 Algorithm Steps

Inputs Number of hyperparameters (N) swarm size (S)acceleration constants (1198621 1198622) inertia constant (W) max-imum value of velocity (119881119898119886119909) minimum value of velocity(V119898119894119899) maximum number of iterations (t119898119886119909) evolutionthreshold () training set (T) and test set (Z)Output The optimal solution HProcedure

Step 1 For klarr9978881 to NLet h119896 be the k119905ℎ hyperparameterIf domain of h119896 is continuous then

let 119861119896119897119900119908 be the lower bound of h119896 and 119861119896119906119901be the upper bound of h119896

let user enter the lower and upper boundsof a hyperparameter h119896

End of if

8 Security and Communication Networks

(4) Initialize P and V vectors of Sparticles each of N length

(8) For all S particles

(12) For all S particles(16) Output

Yes

Terminate

Start User

(2) Define Domains for hk

(3) Create Hyper-parameters amp velocity generator

(1) Preprocessing Phase (2) Initialization Phase (3) Evolution Phase (4) Finishing Phase

No (15) Check Stop conditions

satisfied

(1) Input N S Vmin Vmax

klarr1 to N

(5) Input T Z C1 C2 W tmax

(6) Pibest larrminusinfin i larr1 to S(7) Gbest larr minusinfin

Compute Flowast(P) and update Pibest

(9) Update Gbest

(10) tlarr1

Compute V P Flowast(P) and Pibest

(13) Update Gbest

(14) tlarrt+1

(11) Compute r1(t) and r2(t)H larr Gbest

Figure 3 The flowchart of the proposed algorithm

Else

Let Y119896 be the set of all possible values of h119896

Let user enter all elements of the set Y119896

End of elseEnd of for

Step 2 Let 119865lowast be the fitness function which constructs DNNtuned with the given hyperparameters then trains DNN on119879 and tests it on 119885 Finally 119865lowast computes the accuracy ofDNN as output

Step 3 Let G119887119890119904119905 be the global best vector of the swarm oflength N

Let GS be the best fitness score of the swarmGSlarr997888 minusinfin

Step 4 For ilarr9978881 to SLet P119894 be the position vector of the 119894th particle oflength NLet V 119894 be the velocity vector of the 119894th particle oflength NLet 119875119894119887119890119904119905 be the personal best vector of the 119894thparticle of length NLet PS119894 be the fitness score of the personal bestvector of the 119894th particleFor jlarr9978881 to N

If domain of h119895 is continuous thenselect h119895 uniformly distributed

119875[119895] larr997888 U(119861119895119897119900119908

119861119895119906119901)End of ifElse

Select h119895 randomly by 119875119894[j] larr997888RAND (Y119895)

End of else119881119894[119895] larr997888 U(119881119898119894119899 119881119898119886119909)

End of for119875119894119887119890119904119905 larr997888 119875119894Let FS119894 be the fitness score of the 119894th particle

119865119878119894 larr997888 119865lowast(119875119894)119875119878119894 larr997888 119865119878119894If FS119894 gt GS then

119866119887119890119904119905 larr997888 119875119894119866119878 larr997888 119865119878119894

End of ifEnd of for

Step 5 Let GS119901119903119907 be the previous best fitness score of theswarm

119866119878119901119903V larr997888 119866119878Let 1199031 and 1199032 be random values in PSOLet 119905 be the current iterationFor tlarr9978881 to t119898119886119909

1199031 larr997888 119880(0 1)1199032 larr997888 119880(0 1)For ilarr997888 1 to S

Update V 119894 according to (1)Update P119894 according to (2)119865119878119894 larr997888 119865lowast(119875119894)If FS119894 gt PS119894 then119904119904119904119875119894119887119890119904119905 larr997888 119875119894119875119878119894 larr997888 119865119878119894End of ifIf PS119894 gt GS then119866119887119890119904119905 larr997888 119875119894119887119890119904119905119866119878 larr997888 119875119878119894End of if

End of forIf 119866119878- 119866119878119901119903V lt then

go to Step 6End of if

Security and Communication Networks 9

Table 4 PSO parameters recommended values or ranges

Parameter ValueRange119878 [5 20]119881119898119894119899 0119881119898119886119909 11198621 21198622 2119882 [04 09]119905119898119886119909 [30 50] 00001

119866119878119901119903V larr997888 119866119878End of for

Step 6 Let119867 be the optimal hyperparameters vector119867 larr997888 119866119887119890119904119905Return119867 and Terminate

44 PSO Parameters Selection of the value of PSO param-eters (S V119898119886119909 V119898119894119899 1198621 1198622 W t119898119886119909 ) is a very complexprocess Fortunatelymany empirical and theoretical previousstudies have been published to solve this problem [37ndash40] They introduced some recommended values of PSOparameters which can be taken Table 4 shows every PSOparameter and the corresponding recommended value orrange Thus for those parameters which have recommendedranges we can select a value for each parameter from its rangerandomly and fix it as a constant during the execution of PSO

5 Experimental Setup and Models

This section explains the methodology of performing ourempirical experiments as well as the description of deeplearning models which we used to detect masquerades Asmentioned in Section 3 we selected three UNIX commandline-based datasets (SEA Greenberg PU) Each of thesedatasets is a collection of text files inwhich each text file repre-sents a userThe text file of each user in the particular datasetcontains a set of UNIX commands that are issued by that userThis reflects the fact that these datasets do not contain anyreal masqueraders However to simulate masqueraders andto use these datasets in masquerade detection special dataconfigurations must be implemented prior to proceeding inour experiments According to Section 3 and its subsectionseach dataset has its two different types of data configurationsTherefore we obtained six data configurations that each onewill be observed separately which yields in the result to sixindependent experiments for each model Finally masquer-ade detection can be applied to these data configurationsby following two different main approaches namely staticclassification and dynamic classificationThe two subsequentsubsections present the difference between them as well aswhich deep learning models are exploited for each one

51 Static Classification Approach In the static classificationapproach the classification task is carried out using a dataset

of samples which are represented by a set of static features[30] These static features are defined according to the natureof the task where the classification will be applied In additionto that the dataset samples or also called observations arecollected manually by some experts working in the field ofthat classification task After that these samples are split intotwo independent sets known as training and test sets to trainand test the selected model respectively Static classificationapproach has pros and cons as well Although it provides afaster and easier solution it requires a ready-to-use datasetwith static features The existence of such dataset might notbe available in some complex classification tasks Hence theattempt to create a dataset with static features will be a hardmission In our work we decided to utilize the existenceof three famous UNIX command line-based datasets toimplement six different data configurations Each user inthe particular data configuration has a specific number ofblocks which are represented by a set of static featuresIndeed these features are the userrsquos UNIX commands incharge of describing the behavior of that user and laterhelping the classifier to detect masquerades We decided touse two well-known deep learning models namely DeepNeural Networks (DNN) and Recurrent Neural Networks(RNN) to accomplish the staticmasquerade detection task onthe implemented six data configurations

511 Deep Neural Networks In Section 4 we explained indetail the DNN structure and the problem of the selection ofits hyperparameters We also proposed PSO-based algorithmto obtain the optimal hyperparameters vector thatmaximizedthe accuracy of the DNN on the given training and test setsIn this subsection we describe how we utilized the proposedPSO-based algorithm and the DNN in static masqueradedetection task using six of data configurations which areSEA SEA 1v49 Greenberg Truncated Greenberg EnrichedPU Truncated and PU Enriched Every data configurationof them has its structure and a specific number of users asdescribed in Section 3 So we will have six separate DNN-experiments and each experiment will be on one of the dataconfigurations

The methodology of our DNN-experiments consists offour consecutive stages which are initialization optimiza-tion results extraction and finishing stages The first stageis to initialize all required operating parameters as well asto prepare the particular data configurationrsquos files in whicheach file represents a user in that data configurationThe userfile consists of the training set followed by the test set of thatuser We set all PSO parameters for all DNN-experiments asfollows S=20 V119898119894119899=0 V119898119886119909= 1 1198621=1198622=2 W=09 t119898119886119909=30and =10minus4 Then the last step in the initialization stage is todefine hyperparameters of the DNN and their domains Weused twelve different DNN hyperparameters (N=12) Table 5shows each DNN hyperparameter and its correspondingdefined domain All the used hyperparameters are numericalexcept that Optimizer Layer type Initialization function andActivation function hyperparameters are categorical In thiscase a list of all possible values is indexed to a sequenced-numbered range from 1 to the length of that list Optimizerlist includes elements Adagrad Nadam Adam Adamax

10 Security and Communication Networks

Table 5 The used DNN hyperparameters and their domains

Hyperparameter Domain DescriptionLearning rate [001 09] ContinuousMomentum [01 09] ContinuousDecay [0001 001 ContinuousDropout rate [01 09] ContinuousNumber of hidden layers [1 10] Discrete with step=1Numbers of neurons of hidden layers [1 100] Discrete with step=1Number of epochs [5 20] Discrete with step=5Batch size [100 1000] Discrete with step=50Optimizer [1 6] Discrete with step=1Initialization function [1 8] Discrete with step=1Layer type [1 2] Discrete with step=1Activation function [1 8] Discrete with step=1

RMSprop and SGD Layer type list contains two elementswhich are Dropout and Dense Initialization function listincludes elements Zero Normal Lecun uniform UniformGlorot uniform Glorot normal He uniform and He normalFinally Activation list has eight elements which are LinearSoftmax ReLU Sigmoid Tanh Hard Sigmoid Softsign andSoftplus It is worth mentioning that the elements of all cate-gorical hyperparameters are defined inKeras implementation[30]

The optimization and results extraction stages will beperformed once for each user in the particular data configu-ration that is they will be repeated for each user119880119894 i=12 M where 119872 is the number of users in the particular dataconfiguration119863The optimization stage starts by splitting thedata of the user119880119894 into two independent sets119879119894 and119885119894 whichare the training and test sets of the ith user respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 All blocks ofthe training and test sets are converted from text to numericvalues and then are normalized in [0 1] After that wesupplied these sets to the proposed PSO-based algorithm tofind the optimized hyperparameters vector119867119894 for the ith userIn addition to that we will save a copy of 119867119894 values in adatabase in order to save time and use them again in theRNN-experiment of that particular data configuration D aswill be presented in Section 512 The results extraction stagetakes place when constructing the DNN that is tuned by 119867119894trains the DNN on 119879119894 and tests the DNN on119885119894 The values ofthe classification outcomes True Positive (TP119894) False Positive(FP119894) True Negative (TN 119894) and False Negative (FN 119894) for theith user in the particular data configuration 119863 are extractedand saved for further processing later

Then the next user is observed and same procedure ofoptimization and results extraction stages is performed tillthe last user in the particular data configuration119863 is reachedFinally when all users in the particular data configurationare completed the last stage (finishing stage) is executedFinishing stage computes the summation of all obtained TPsof all users in the particular data configuration 119863 denotedby TP The same process will be applied also to the otheroutcomes namely FP TN and FN Equations (3) (4)

(5) and (6) express the formulas of TP FP TN and FNrespectively

119879119875 = 119872sum119894=1

119879119875119894 (3)

119865119875 = 119872sum119894=1

119865119875119894 (4)

119879119873 = 119872sum119894=1

119879119873119894 (5)

119865119873 = 119872sum119894=1

119865119873119894 (6)

The finishing stage will report and save these outcomes andend the DNN-experiment for the particular data configura-tion 119863 The former outcomes will be used to compute tenwell-known evaluation metrics to assess the performanceof the DNN on the particular data configuration D as willbe presented in Section 6 It is worth saying that the sameprocedure which is explained above will be done for eachdata configuration Figure 4 depicts the flowchart of themethodology of the DNN-experiments

512 Recurrent Neural Networks TheRecurrent Neural Net-work is a special type of the traditional feed-forwardArtificialNeural Network Unlike traditional ANN in the RNN eachneuron in any of the hidden layers has additional connectionsfrom its output to itself (self-recurrent) as well as to otherneurons of the same hidden layer Therefore the output ofthe RNNrsquos hidden layer at any time step (t) is for the currentinputs and the output of the hidden layer at the previous timestep (t-1) In RNN these directed cycles allow informationto circulate in the network and make the hidden layers asthe storage unit of the whole network [41] The importantcharacteristics of the RNN are the capability to have memoryand generate periodical sequences

Despite that the conventional RNN structure which isdescribed above has a serious problem especially when the

Security and Communication Networks 11

(9) Construct DNN that is tuned by Hi

(10) Train DNN on Ti

(11) Test DNN on Zi

No

(16) OutputTP FP TN and FN

Yes

End

Start

(1) Input Data configuration D M

(2) Set PSO parameters values

(3) Define Domains for Hyper-parameters

(3) Results Extraction Stage (4) Finishing Stage(1) Initialization Stage (2) Optimization Stage

(6) Execute the proposed PSO-based algorithm

(15) Compute and save TP FP TN and FN for D

(8) Database

(4) ilarr1

(7) Obtain Hi of the user Ui

(5) Create Ti and Zi sets of the user Ui

Hi

(12) Obtain and save TPi FPi TNi andFNi for the user Ui

(14) Is i gt M

(13) ilarri+1

Figure 4 The flowchart of the DNN-experiments

Inputxt

it ctℎt

Outputot

ft

Figure 5 The structure of an LSTM cell [6]

RNN is trained using the back-propagation technique Theproblem is known as gradient vanishing and exploding [42]The gradient vanishing problem occurs when the gradientsignal gets so small over the network which causes learningto become very slow or stop On the other hand the gradientexploding problem occurs when the gradient signal gets solarge in which learning divergesThis problem of the conven-tional RNN limited the use of the RNN to be only in short-termmemory tasks To solve this problem a new architectureof RNN is proposed by Hochreiter and Schmidhuber [43]known as Long Short-Term Memory (LSTM) LSTM uses anew structure called a memory cell that is composed of fourparts which are an input gate a neuron with a self-recurrentconnection a forget gate and the output gateMeanwhile themain goal of using a neuron with a self-recurrent connectionis to record information the aim of using three gates is tocontrol the flow of information from or into the memory cellThe input gate decides if to allow the incoming informationto enter into the memory cell or block it Moreover the forgetgate controls if to pass the previous state of the memory cellto alter the current state of the memory cell or prevent itFinally the output gate determines if to pass the output ofthe memory cell or not Figure 5 shows the structure of anLSTM memory cell Rather than overcoming the problemsof the conventional RNN LSTM model also outperformsthe conventional RNN in terms of performance especially inlong-term memory tasks [5] The LSTM-RNN model can beobtained by replacing every neuron in the hidden layers ofthe RNN to an LSTMmemory cell [6]

In this study we used the LSTM-RNN model to performa static masquerade detection task on all data configurationsAs mentioned in Section 511 there are six data config-urations and each of them will be used in the separate

experiment So we will have six separate LSTM-RNN-experiments each experiment will be on one of the dataconfigurations The methodology of all of these experimentsis the same and as follows for the given data configurationD we firstly prepared all the given data configurationrsquos filesby converting all blocks from text to numerical values andthen normalizing them in [0 1] Next to that for each user119880119894 in D where i=12 M and 119872 is the number of users inD we did the following steps we split the data of 119880119894 into twoindependent sets 119879119894 and 119885119894 which are the training and testsets of the ith user in D respectively The splitting processfollowed the structure of the particular data configurationwhich is described in Section 3 After that we retrieved thestored optimized hyperparameters vector of the ith user (119867119894)from the database which is created in the previous DNN-experiments Then we constructed the RNN model that istuned by119867119894 In order to obtain the LSTM-RNNmodel everyneuron in any of the hidden layers is replaced to an LSTMmemory cell The constructed LSTM-RNN model is trainedon119879119894 and then tested on119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 6 depicts theflowchart of the methodology of LSTM-RNN-experiments

52 Dynamic Classification Approach In contrast of staticclassification approach dynamic classification approach doesnot need a ready-to-use dataset with static features [30] Itcovenants directly with raw data sources such as text imagevideo sound and signal files and extracts features from themdynamically The models that use this approach try to learnand represent features in unsupervised manner Then thesemodels train themselves using the extracted features to beable to classify unseen dataThe deep learningmodels fit verywell for this approach because the main objectives of deeplearning models are the strong ability of automatic featureextraction and self-learning Rather than that dynamicclassification models overcome the problem of the lake ofdatasets it performs more efficient than the static classifica-tionmodels Despite these advantages dynamic classificationapproach has also drawbacks Dynamic classification modelsare slower and take a long time to train if compared with

12 Security and Communication Networks

YesNo

Hi

Start

(1) InputData configuration D M

(2) Prepare files of D

(4) Split data of Ui

into Ti and Zi sets

(7) Train LSTM-RNN model on Ti

(8) Test LSTM-RNN model on Zi

End

(5) Database

(6) Construct LSTM-RNN model that is tuned by Hi

(3) ilarr1

(9) Obtain and save TPi FPi TNi andFNi for the user Ui

(10) ilarri+1

(11) Is i gt M

(13) Output TPFP TN and FN

(12) Compute andsave TP FP TN

and FN for D

Figure 6 The flowchart of the LSTM-RNN-experiments

static classification models due to complex deep structure ofthesemodels as well as the huge amount of computations thatare required to execute Furthermore dynamic classificationmodels require a very large amount of input samples to gainhigh accuracy values

In this research we used six data configurations that areimplemented from three textual datasets In order to applydynamic masquerade detection on these data configurationswe need amodel that is able to extract features from the userrsquoscommand text file dynamically and then classify the user intoone of the two classes that will be either a normal user or amasqueraderTherefore we dealwith a text classification taskThe text classification is defined as a task that assigns a pieceof text (a word a sentence or even a document) to one ormore classes according to its content Indeed there are threetypes of text classification namely sentence classificationsentiment analysis and document categorization In sentenceclassification a given sentence should be assigned correctlyto one of possible classes Furthermore sentiment analysisdetermines if a given sentence is a positive negative orneutral towards a specific subject In contrast documentcategorization deals with documents and determines whichclass from a given set of possible classes a document belongsto According to the nature of dynamic classification as well asthe functionality of text classification deep learning modelsare the fittest among the other machine learning models forthese types of classification due to their powerful capability offeatures learning

A wide range of researches have been accomplished inthe literature in the field of text classification using deeplearning models It was started by LeCun et al in 1998 whenthey proposed a special topology of the Convolutional NeuralNetwork (CNN) known as LeNet family and used it in textclassification efficiently [44]Then various studies have beenpublished to introduce text classification algorithms as wellas the factors that impact the performance [45ndash47] In thestudy [48] the CNNmodel is used for sentence classificationtask over a set of text dataset benchmarks A single one-dimensional CNN is proposed to learn a region-based textembedding [49] X Zhang et al introduced a novel character-based multidimensional CNN for text classification taskswith competitive results [50] In the research [51] a newhierarchal approach calledHierarchal Deep Learning for Text

classification (HDLTex) is proposed and three deep struc-tures which are DNN RNN and CNN are used A recurrentconvolutional network model is introduced [52] for textclassification and high results are obtained on documents-level datasets A novel LSTM-based model is introduced andused for text classification withmultitask learning framework[53] The study [54] proposed a new model called hierarchalattention network for document classification and is testedon six large document-level datasets with good results Acharacter-level text representations approach is proposed andtested for text classification tasks using deep CNN [55]As noticed the CNN is the mostly used deep learningmodel for text classification tasks So we decided to use theCNN to perform dynamic masquerade detection on all dataconfigurations The following subsection reviews the CNNand explains the structure of the used CNN model and themethodology of our CNN-experiments

521 Convolutional Neural Networks The ConvolutionalNeural Network (CNN) is a deep learning model whichis biological-inspired from the animal visual cortex TheCNN can be considered as a special type of the traditionalfeed-forwardArtificial Neural NetworkThemajor differencebetween ANN and CNN is that instead of the fully connectedarchitecture of ANN the individual neurons in CNN areconnected to subregions of the input field The neurons ofthe CNN are arranged in such a way they are tilled to coverthe entire input field The typical CNN consists of five maincomponents namely an input layer the convolutional layerthe pooling layer the fully connected layer and an outputlayer The input layer is where the input data is enteredinto the CNN The first convolutional layer in the CNNconsists of individual neurons that each of them is connectedto a small subset of the input field The neurons in thenext convolutional layers connect only to a subset of theirpreceding pooling layerrsquos outputMoreover the convolutionallayers in the CNN use a set of learnable kernels or filters thateach filter is applied to the specified subset of their precedinglayerrsquos output These filters calculate feature maps in whicheach feature map shares the same weights The poolinglayer also known as a subsampling layer is a nonlineardownsampling function that condenses subsets of its inputThemain goal of using pooling layers in the CNN is to reduce

Security and Communication Networks 13

Userrsquos Command Text Files

Quantization

Input Layer

Convolutional layer

C1 features map P1 features map

Max-Pooling layer

C2 P2 C6 P6

Fully-Connected dropout layers

2048 sigmoid neurons

2048 sigmoid neurons 2

softmaxneurons

Outputdense layer

0 (Normal)1 (Masquerader)

Figure 7 The architecture of the used CNNmodel

the complexity and computations by reducing the size of theirpreceding layerrsquos output There are many pooling nonlinearfunctions that can be used but among them max-poolingis the mostly used which selects the maximum value in thegiven pooling window Typically each convolutional layer inthe CNN is followed by a max-pooling layer The CNN hasone or more stacked convolutional layer and max-poolinglayer pairs to extract features from the entire input and thenmap these features to their next fully connected layerThe toplayers of the CNN are one or more of fully connected layerswhich are similar to hidden layers in the DNN This meansthat neurons of the fully connected layers are connected to allneurons of the preceding layer The output layer is the finallayer in the CNN and is responsible for reporting the outputvalue of the CNN Finally the back-propagation algorithm isusually used to train CNNs via Stochastic Gradient Decent(SGD) to adjust the weights of the fully connected layers [56]There are several variant structures of CNN that are proposedin the literature but LeNet structure which is proposed byLeCun et al [44] is themost common approach used inmanyapplications of computer vision and text classification

Regarding its stability and high efficiency in text clas-sification we selected the CNN model which is proposedin [50] to perform a dynamic masquerade detection on alldata configurationsThe usedmodel is a character-level CNNthat takes a text file as input and outputs the classificationscore (0 if the input text file is related to a normal user or1 otherwise) The used CNN model is from LeNet familyand consists of an input layer followed by six convolutionand max-pooling pairs followed by two fully connectedlayers and finally followed by an output layer In the inputlayer the text quantization process takes place when theused model encodes all letters in the input text file using aone-hot representation from a 70-character alphabet All theconvolutional layers in the used CNN model have a ReLUnonlinear activation functionThe two fully connected layersin the used CNN model are of the type dropout layer withdropout probability equal to 05 In addition to that the twofully connected layers in the usedCNNmodel have a Sigmoidnonlinear activation function as well as they have the samesize of 2048 neurons of each The output layer in the usedCNN model is of the type dense layer as well as it has asoftmax activation function and size of two neurons Theused CNN model is trained by back-propagation algorithmvia SGD Finally we set the following parameters to the

used CNN model learning rate=001 epochs=30 and batchsize=64 These values are obtained experimentally by per-forming a grid search to find the best possible values of theseparameters Figure 7 shows the architecture of the used CNNmodel and is reproduced from Zhang et al (2015) [under theCreative Commons Attribution Licensepublic domain]

In our work we used a CNNmodel to perform a dynamicmasquerade detection task on all data configurations Asmentioned in Section 511 there are six data configurationsand each of them will be used in the separate experimentSo we will have six separate CNN-experiments and eachexperiment will be on one of the data configurations Themethodology of all of these experiments is the same and asfollows for the given data configurationD we firstly preparedall the given data configurationrsquos text files such that each file ofthem represents the training and test sets of a user in119863 Nextto that for each user 119880119894 in D where i=12 M and119872 is thenumber of users in D we did the following steps we split thedata of 119880119894 into two independent sets 119879119894 and 119885119894 which are thetraining and test sets of the ith user in D respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 Furthermorewe also moved each block in the training and test sets of theuser 119880119894 to a separate text file This means that each of thetraining and test sets of the user 119880119894 consists of a specifiednumber of text files in which each text file contains one blockof UNIX commands After that we constructed the usedCNN model The constructed CNN model is trained on 119879119894and then tested on 119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 8 depicts theflowchart of the methodology of CNN-experiments

6 Results and Discussion

We carried out three major empirical experiments whichareDNN-experiments LSTM-RNN-experiments andCNN-experiments Each of them consists of six separate subex-periments where each subexperiment is performed on oneof the data configurations SEA SEA 1v49 Greenberg Trun-cated Greenberg Enriched PU Truncated and PU Enriched

14 Security and Communication Networks

YesNo

Start

(1) Input

(2) Prepare text files of D

(4) Split data of Ui

Ti and Zi text sets(6) Construct the used CNN model

(7) Train CNN model on Ti

(8) Test CNN model on Zi

(13) Output TP FP TN and FNEnd

(5) Move each block in Ti and Zi to a separate text file

Data configuration D M

(3) ilarr1

(12) Compute and save TPFP TN and FN for D

(9) Obtain and save TPi FPi TNiand FNi for the user Ui

(11) Is i gt M

(10) ilarri+1

into

Figure 8 The flowchart of the CNN-experiments

Table 6 The confusion matrix of the masquerade detection out-comes

Actual Class Predicted ClassNormal User Masquerader

Normal User TN FPMasquerader FN TP

Basically our PSO-based DNN hyperparameters selectionalgorithmwas implemented in Python 364 [57]withNumPy[58] Moreover all models (DNN LSTM-RNN CNN) wereconstructed and trained and tested based on Keras [59 60]with TensorFlow 16 [61 62] that backend over CUDA 90[63] and cuDNN 70 [64] In addition to that all experimentswere performed on a workstation with an Intel Core i7 CPU(38GHz 16 MB Cache) 16GB of RAM and theWindows 10operating system In order to accelerate the computations inall experiments we also used a GPU-accelerated computingwith NVIDIA Tesla K20 GPU 5GB GDDR5The experimen-tal environment is processed in 64-bit mode

In any classification task we have four possible outcomesTrue Positive (TP) True Negative (TN) False Positive (FP)and False Negative (FN) We get a TP when a masqueraderis correctly classified as a masquerader Whenever a gooduser is correctly classified as a good user itself we say it isa TN A FP occurs when a good user is misclassified as amasquerader In contrast FN occurs when a masqueraderis misclassified as a good user Table 6 shows the ConfusionMatrix of the masquerade detection outcomes For eachdata configuration we used the obtained outcomes for thatdata configuration to compute twelve well-known evaluationmetrics After that by using these evaluation metrics weassessed the performance of each deep learningmodel on thatdata configuration

For simplicity we divided these evaluation metrics intotwo categories General Classification Measures and Mas-querade Detection Measures The General ClassificationMeasures are metrics that are used for any classification tasknamely Accuracy Precision Recall and F1-Score On theother handMasquerade DetectionMeasures are metrics thatusually are used for a masquerade or intrusion detection

task which are Hit Rate Miss Rate False Alarm RateCost Bayesian Detection Rate Bayesian True Negative RateGeometric Mean andMatthews Correlation CoefficientTheused evaluation metrics definition and their correspondingequations are as follows

(i) Accuracy shows the rate of true detection over all testsets

119860119888119888119906119903119886119888119910 = 119879119875 + 119879119873119879119875 + 119879119873 + 119865119875 + 119865119873 (7)

(ii) Precision shows the rate of correctly classified mas-queraders from all blocks in the test set that areclassified as masqueraders

119875119903119890119888119894119904119894119900119899 = 119879119875119879119875 + 119865119875 (8)

(iii) Recall shows the rate of correctly classified masquer-aders over all masquerader blocks in the test set

119877119890119888119886119897119897 = 119879119875119879119875 + 119865119873 (9)

(iv) F1-Score gives information about the accuracy of aclassifier regarding both Precision (P) and Recall (R)metrics

1198651 119878119888119900119903119890 = 21119875 + 1119877 (10)

(v) Hit Rate shows the rate of correctly classified mas-querader blocks over all masquerader blocks pre-sented in the test set It is also called Hits TruePositive Rate or Detection Rate

119867119894119905 119877119886119905119890 = 119879119875119879119875 + 119865119873 (11)

(vi) Miss Rate is the complement of Hit Rate (Miss=100-Hit) ie it shows the rate of masquerade blocksthat are misclassified as a normal user from allmasquerade blocks in the test set It is also calledMisses or False Negative Rate

119872119894119904119904 119877119886119905119890 = 119865119873119865119873 + 119879119875 (12)

Security and Communication Networks 15

(vii) False Alarm Rate (FAR) gives information about therate of normal user blocks that are misclassified as amasquerader over all normal user blocks presented inthe test set It is also called False Positive Rate

119865119886119897119904119890 119860119897119886119903119898 119877119886119905119890 = 119865119875119865119875 + 119879119873 (13)

(viii) Cost is a metric that was proposed in [9] to evaluatethe efficiency of a classifier concerning bothMiss Rate(MR) and False Alarm Rate (FAR) metrics

119862119900119904119905 = 119872119877 + 6 times 119865119860119877 (14)

(ix) Bayesian Detection Rate (BDR) is a metric basedon Base-Rate Fallacy problem which is addressedby S Axelsson in 1999 [65] Base-Rate Fallacy is abasis of Bayesian statistics and occurs when peo-ple do not take the basic rate of incidence (Base-Rate) into their account when solving problems inprobabilities Unlike Hit Rate metric BDR shows therate of correctly classified masquerader blocks overall test set taking into consideration the base-rate ofmasqueraders Let I and Ilowast denote a masquerade anda normal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectivelyThen BDR can be computed asthe probability P(I | A) according to (15) [65]119861119886119910119890119904119894119886119899 119863119890119905119890119888119905119894119900119899 119877119886119905119890 = 119875 (119868 | 119860)

= 119875 (119868) times 119875 (119860 | 119868)119875 (119868) times 119875 (119860 | 119868) + 119875 (119868lowast) times 119875 (119860 | 119868lowast)(15)

P(I) is the rate of the masquerader blocks in the testset P(A | I) is the Hit Rate P(Ilowast) is the rate of thenormal blocks in the test set and P(A | Ilowast) is the FAR

(x) Bayesian True Negative Rate (BTNR) is also basedon Base-Rate Fallacy and shows the rate of trulyclassified normal blocks over all test set in which thepredicted normal behavior indicates really a normaluser [65] Let I and Ilowast denote a masquerade and anormal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectively Then BTNR can be computedas the probability P(Ilowast | Alowast) according to (16) [65]

119861119886119910119890119904119894119886119899 119879119903119906119890 119873119890119892119886119905119894V119890 119877119886119905119890 = 119875 (119868lowast | 119860lowast)= 119875 (119868lowast) times 119875 (119860lowast | 119868lowast)

119875 (119868lowast) times 119875 (119860lowast | 119868lowast) + 119875 (119868) times 119875 (119860lowast | 119868)(16)

P(Ilowast) is the rate of the normal blocks in the test setP(Alowast | Ilowast) is the True Negative Rate which is easilyobtained by calculating (1-FAR) P(I) is the rate of themasquerader blocks in the test set and P(Alowast | I) isthe Miss Rate

(xi) Geometric Mean (g-mean) is a performance metricthat combines true negative rate and true positive

rate at one specific threshold where both the errorsare considered equal This metric has been usedby several researchers for evaluating classifiers onimbalance dataset [66] It can be computed accordingto (17) [67]

119892 119898119890119886119899 = radic 119879119875 times 119879119873(119879119875 + 119865119873) times (119879119873 + 119865119875) (17)

(xii) Matthews Correlation Coefficient (MCC) is a perfor-mance metric that takes into account true and falsepositives and negatives and is generally regarded asa balanced measure which can be used even if theclasses are of very different sizes (imbalance dataset)[68] MCC has a range of minus1 to 1 where minus1 indicates acompletely wrong binary classifier while 1 indicates acompletely correct binary classifier Unlike the othermetrics discussed aboveMCC takes all the cells of theConfusion Matrix into consideration in its formulawhich can be computed according to (18) [69]

119872119862119862= (119879119875 times 119879119873) minus (119865119875 times 119865119873)radic(119879119875 + 119865119873) times (119879119875 + 119865119875) times (119879119873 + 119865119875) times (119879119873 + 119865119873)

(18)

In the following two subsections we will present our experi-mental results and explain them using two kinds of analysesperformance analysis and ROC curves analysis

61 Performance Analysis The effectiveness of any modelto detect masqueraders depends on its values of evaluationmetrics The higher values of Accuracy Precision RecallF1-Score Hit Rate Bayesian Detection Rate Bayesian TrueNegative Rate Geometric Mean and Matthews CorrelationCoefficient as well as the lower values of Miss Rate FalseAlarm Rate and Cost indicate an efficient classifierThe idealclassifier hasAccuracy andHit Rate values that reach 1 as wellasMiss Rate and False AlarmRate values that reach 0 Table 7presents the percentages of the used evaluation metricsfor DNN-experiments LSTM-RNN-experiments and CNN-experiments Actually the rows labeled by DNN and LSTM-RNN in Table 7 show results of the static masquerade detec-tion by using DNN and LSTM-RNN models respectivelywhereas the rows labeled by CNN in Table 7 show resultsof the dynamic masquerade detection by using CNN modelFurthermore the bold rows represent the best results amongthe same data configuration whereas the underlined valuesare the best for all data configurations

First of all the impact of using our PSO-based algorithmcan be seen in the obtained results of both DNN and LSTM-RNN models The PSO-based algorithm is used to optimizethe selection of DNN hyperparameters that maximized theaccuracy which means that the sum of TP and TN outcomeswill be increased significantly Thus according to (11) and(13) increasing the sum of TP and TN will lead definitelyto the increase of the value of Hit as well as to the decreaseof the value of FAR Although the accuracy values of SEA1v49 data configuration for all models are slightly lower than

16 Security and Communication Networks

Table 7 The results of our experiments

Dataset DataConfiguration Model Evaluation Metrics ()

Accuracy Precision Recall F1-Score Hit Miss FAR Cost BDR BTNR g-mean MCC

SEA Dataset

SEADNN 9808 7626 8485 8033 8485 1515 128 2283 7625 9926 9152 7945

LSTM-RNN 9852 8230 8658 8439 8658 1342 090 1883 8233 9934 9263 8364CNN 9884 8777 8701 8739 8701 1299 059 1651 8772 9937 93 8678

SEA 1v49DNN 9654 9998 9643 9817 9643 357 048 647 9998 5204 9796 7064

LSTM-RNN 9786 9998 9779 9887 9779 221 038 448 9998 6370 987 7874CNN 9878 9999 9874 9936 9874 126 019 240 9999 7551 9927 8622

GreenbergDataset

GreenbergTruncated

DNN 9397 9223 8067 8606 8067 1933 204 3157 9222 9441 8889 8253LSTM-RNN 9472 9488 8153 8770 8153 1847 132 2639 9487 9468 897 8476

CNN 9543 9616 8353 8940 8353 1647 10 2247 9616 9524 9094 8686

GreenbergEnriched

DNN 9757 9692 9240 9461 9240 760 088 1288 9692 9775 957 9308LSTM-RNN 9798 9757 9360 9554 9360 640 070 1060 9756 9810 9641 9428

CNN 9860 9855 9533 9692 9533 467 042 719 9855 9861 9743 9603

PU Dataset

PU TruncatedDNN 810 9959 7861 8786 7861 2139 225 3489 9959 3949 8766 5463

LSTM-RNN 8219 9969 7989 8870 7989 2011 175 3061 9968 4110 886 5646CNN 8375 9974 8164 8979 8164 1836 150 2736 9973 4338 8968 5879

PU EnrichedDNN 9044 9984 8921 9423 8921 1079 10 1679 9984 5672 9398 7064

LSTM-RNN 9131 9988 9018 9478 9018 982 075 1432 9988 5908 9461 7261CNN 9375 9992 9293 9630 9293 707 050 1007 9992 6678 9616 7852

the corresponding values of SEA data configuration also Hitvalues are dramatically increased in SEA 1v49 for all modelsby 10-14 from those that are in the SEA data configurationThis is due to the structure of SEA 1v49 data configurationwhere there are 122500 masquerader blocks in the test setof SEA 1v49 comparing to only 231 blocks in the SEA dataconfiguration Moreover the FAR values of SEA 1v49 for allmodels are significantly lower than the corresponding valuesof SEA data configuration Hence regarding SEA datasetSEA 1v49 is better to use in masquerade detection than SEAdata configuration

On the other hand as we expected Greenberg Enrichedenhanced noticeably the performance of all models in termsof all used evaluation metrics from the corresponding val-ues of Greenberg Truncated data configuration This canbe explained by the fact that Greenberg Enriched dataconfiguration has more information about user behaviorincluding command name parameters aliases and flagscomparing to only command name in Greenberg TruncatedTherefore regarding Greenberg dataset Greenberg Enricheddata configuration is better to use in masquerade detectionthan Greenberg Truncated The same thing happened inPU dataset where its PU Enriched data configuration hasbetter results regarding all models than PU Truncated Thusregarding PU dataset PU Enriched is better to use inmasquerade detection than PUTruncated data configuration

Actually PU Truncated and Greenberg Truncated dataconfigurations simulate SEA and SEA 1v49 data configu-rations where only command name is considered Despitethat regarding all used models SEA 1v49 recorded thebest results among the other truncated data configurationsOn the other hand PU Enriched and Greenberg Enriched

are considered as enriched data configurations where extrainformation about users is taken into consideration Due tothat enriched data configurations help models to build userrsquosbehavior profile more accurately than with truncated dataconfigurations Regarding all models the results associatedwithGreenberg Enriched especially in terms ofAccuracyHitand FAR values are better than of the corresponding valuesof PU Enriched data configuration because PU dataset isvery small masquerade detection dataset with a relatively lownumber of users (only 8 users) Also this reason can explainwhy a few previous works used PU dataset in masqueradedetection However data configurations can be sort for allused models from the upper to lower according to theobtained results as follows SEA 1v49 Greenberg EnrichedPU Enriched SEA Greenberg Truncated and PUTruncated

For the sake of brevity and space limitation we selected asubset of the used performancemetrics inTable 7 to be shownvisually in Figures 9 and 10 Figures 9(a) 9(b) 9(c) 9(d)9(e) 9(f) 9(g) and 9(h) showAccuracy HitMiss FAR CostBDR F1-Score and MCC percentages of the used modelsin each data configuration respectively Figures 10(a) 10(b)10(c) 10(d) 10(e) and 10(f) show Accuracy Hit FAR BDRF1-Score and MCC percentages for the average performanceof the used models on datasets respectively Figures 9 and10 can give us a visual comparison of the performance of theused deep learning models for each data configuration anddataset as well as in all datasets

By taking an inspective look to Figures 9 and 10 we cannotice the stability of deep learning models in such a waythat they are enhancing masquerade detection from a dataconfiguration to another in a consistent pattern To explainthat we will discuss the obtained results from the perspective

Security and Communication Networks 17

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnriched

PU EnrichedPU Truncated

0102030405060708090

100

Accura

cy (

)

(a)

0102030405060708090

100

Hit

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(b)

0

5

10

15

20

25

Miss

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(c)

002040608

112141618

22224

FAR

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(d)

0

5

10

15

20

25

30

35

Cos

t (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(e)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU Truncated

0102030405060708090

100

BDR

()

PU Enriched

(f)

Figure 9 Continued

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 8: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

8 Security and Communication Networks

(4) Initialize P and V vectors of Sparticles each of N length

(8) For all S particles

(12) For all S particles(16) Output

Yes

Terminate

Start User

(2) Define Domains for hk

(3) Create Hyper-parameters amp velocity generator

(1) Preprocessing Phase (2) Initialization Phase (3) Evolution Phase (4) Finishing Phase

No (15) Check Stop conditions

satisfied

(1) Input N S Vmin Vmax

klarr1 to N

(5) Input T Z C1 C2 W tmax

(6) Pibest larrminusinfin i larr1 to S(7) Gbest larr minusinfin

Compute Flowast(P) and update Pibest

(9) Update Gbest

(10) tlarr1

Compute V P Flowast(P) and Pibest

(13) Update Gbest

(14) tlarrt+1

(11) Compute r1(t) and r2(t)H larr Gbest

Figure 3 The flowchart of the proposed algorithm

Else

Let Y119896 be the set of all possible values of h119896

Let user enter all elements of the set Y119896

End of elseEnd of for

Step 2 Let 119865lowast be the fitness function which constructs DNNtuned with the given hyperparameters then trains DNN on119879 and tests it on 119885 Finally 119865lowast computes the accuracy ofDNN as output

Step 3 Let G119887119890119904119905 be the global best vector of the swarm oflength N

Let GS be the best fitness score of the swarmGSlarr997888 minusinfin

Step 4 For ilarr9978881 to SLet P119894 be the position vector of the 119894th particle oflength NLet V 119894 be the velocity vector of the 119894th particle oflength NLet 119875119894119887119890119904119905 be the personal best vector of the 119894thparticle of length NLet PS119894 be the fitness score of the personal bestvector of the 119894th particleFor jlarr9978881 to N

If domain of h119895 is continuous thenselect h119895 uniformly distributed

119875[119895] larr997888 U(119861119895119897119900119908

119861119895119906119901)End of ifElse

Select h119895 randomly by 119875119894[j] larr997888RAND (Y119895)

End of else119881119894[119895] larr997888 U(119881119898119894119899 119881119898119886119909)

End of for119875119894119887119890119904119905 larr997888 119875119894Let FS119894 be the fitness score of the 119894th particle

119865119878119894 larr997888 119865lowast(119875119894)119875119878119894 larr997888 119865119878119894If FS119894 gt GS then

119866119887119890119904119905 larr997888 119875119894119866119878 larr997888 119865119878119894

End of ifEnd of for

Step 5 Let GS119901119903119907 be the previous best fitness score of theswarm

119866119878119901119903V larr997888 119866119878Let 1199031 and 1199032 be random values in PSOLet 119905 be the current iterationFor tlarr9978881 to t119898119886119909

1199031 larr997888 119880(0 1)1199032 larr997888 119880(0 1)For ilarr997888 1 to S

Update V 119894 according to (1)Update P119894 according to (2)119865119878119894 larr997888 119865lowast(119875119894)If FS119894 gt PS119894 then119904119904119904119875119894119887119890119904119905 larr997888 119875119894119875119878119894 larr997888 119865119878119894End of ifIf PS119894 gt GS then119866119887119890119904119905 larr997888 119875119894119887119890119904119905119866119878 larr997888 119875119878119894End of if

End of forIf 119866119878- 119866119878119901119903V lt then

go to Step 6End of if

Security and Communication Networks 9

Table 4 PSO parameters recommended values or ranges

Parameter ValueRange119878 [5 20]119881119898119894119899 0119881119898119886119909 11198621 21198622 2119882 [04 09]119905119898119886119909 [30 50] 00001

119866119878119901119903V larr997888 119866119878End of for

Step 6 Let119867 be the optimal hyperparameters vector119867 larr997888 119866119887119890119904119905Return119867 and Terminate

44 PSO Parameters Selection of the value of PSO param-eters (S V119898119886119909 V119898119894119899 1198621 1198622 W t119898119886119909 ) is a very complexprocess Fortunatelymany empirical and theoretical previousstudies have been published to solve this problem [37ndash40] They introduced some recommended values of PSOparameters which can be taken Table 4 shows every PSOparameter and the corresponding recommended value orrange Thus for those parameters which have recommendedranges we can select a value for each parameter from its rangerandomly and fix it as a constant during the execution of PSO

5 Experimental Setup and Models

This section explains the methodology of performing ourempirical experiments as well as the description of deeplearning models which we used to detect masquerades Asmentioned in Section 3 we selected three UNIX commandline-based datasets (SEA Greenberg PU) Each of thesedatasets is a collection of text files inwhich each text file repre-sents a userThe text file of each user in the particular datasetcontains a set of UNIX commands that are issued by that userThis reflects the fact that these datasets do not contain anyreal masqueraders However to simulate masqueraders andto use these datasets in masquerade detection special dataconfigurations must be implemented prior to proceeding inour experiments According to Section 3 and its subsectionseach dataset has its two different types of data configurationsTherefore we obtained six data configurations that each onewill be observed separately which yields in the result to sixindependent experiments for each model Finally masquer-ade detection can be applied to these data configurationsby following two different main approaches namely staticclassification and dynamic classificationThe two subsequentsubsections present the difference between them as well aswhich deep learning models are exploited for each one

51 Static Classification Approach In the static classificationapproach the classification task is carried out using a dataset

of samples which are represented by a set of static features[30] These static features are defined according to the natureof the task where the classification will be applied In additionto that the dataset samples or also called observations arecollected manually by some experts working in the field ofthat classification task After that these samples are split intotwo independent sets known as training and test sets to trainand test the selected model respectively Static classificationapproach has pros and cons as well Although it provides afaster and easier solution it requires a ready-to-use datasetwith static features The existence of such dataset might notbe available in some complex classification tasks Hence theattempt to create a dataset with static features will be a hardmission In our work we decided to utilize the existenceof three famous UNIX command line-based datasets toimplement six different data configurations Each user inthe particular data configuration has a specific number ofblocks which are represented by a set of static featuresIndeed these features are the userrsquos UNIX commands incharge of describing the behavior of that user and laterhelping the classifier to detect masquerades We decided touse two well-known deep learning models namely DeepNeural Networks (DNN) and Recurrent Neural Networks(RNN) to accomplish the staticmasquerade detection task onthe implemented six data configurations

511 Deep Neural Networks In Section 4 we explained indetail the DNN structure and the problem of the selection ofits hyperparameters We also proposed PSO-based algorithmto obtain the optimal hyperparameters vector thatmaximizedthe accuracy of the DNN on the given training and test setsIn this subsection we describe how we utilized the proposedPSO-based algorithm and the DNN in static masqueradedetection task using six of data configurations which areSEA SEA 1v49 Greenberg Truncated Greenberg EnrichedPU Truncated and PU Enriched Every data configurationof them has its structure and a specific number of users asdescribed in Section 3 So we will have six separate DNN-experiments and each experiment will be on one of the dataconfigurations

The methodology of our DNN-experiments consists offour consecutive stages which are initialization optimiza-tion results extraction and finishing stages The first stageis to initialize all required operating parameters as well asto prepare the particular data configurationrsquos files in whicheach file represents a user in that data configurationThe userfile consists of the training set followed by the test set of thatuser We set all PSO parameters for all DNN-experiments asfollows S=20 V119898119894119899=0 V119898119886119909= 1 1198621=1198622=2 W=09 t119898119886119909=30and =10minus4 Then the last step in the initialization stage is todefine hyperparameters of the DNN and their domains Weused twelve different DNN hyperparameters (N=12) Table 5shows each DNN hyperparameter and its correspondingdefined domain All the used hyperparameters are numericalexcept that Optimizer Layer type Initialization function andActivation function hyperparameters are categorical In thiscase a list of all possible values is indexed to a sequenced-numbered range from 1 to the length of that list Optimizerlist includes elements Adagrad Nadam Adam Adamax

10 Security and Communication Networks

Table 5 The used DNN hyperparameters and their domains

Hyperparameter Domain DescriptionLearning rate [001 09] ContinuousMomentum [01 09] ContinuousDecay [0001 001 ContinuousDropout rate [01 09] ContinuousNumber of hidden layers [1 10] Discrete with step=1Numbers of neurons of hidden layers [1 100] Discrete with step=1Number of epochs [5 20] Discrete with step=5Batch size [100 1000] Discrete with step=50Optimizer [1 6] Discrete with step=1Initialization function [1 8] Discrete with step=1Layer type [1 2] Discrete with step=1Activation function [1 8] Discrete with step=1

RMSprop and SGD Layer type list contains two elementswhich are Dropout and Dense Initialization function listincludes elements Zero Normal Lecun uniform UniformGlorot uniform Glorot normal He uniform and He normalFinally Activation list has eight elements which are LinearSoftmax ReLU Sigmoid Tanh Hard Sigmoid Softsign andSoftplus It is worth mentioning that the elements of all cate-gorical hyperparameters are defined inKeras implementation[30]

The optimization and results extraction stages will beperformed once for each user in the particular data configu-ration that is they will be repeated for each user119880119894 i=12 M where 119872 is the number of users in the particular dataconfiguration119863The optimization stage starts by splitting thedata of the user119880119894 into two independent sets119879119894 and119885119894 whichare the training and test sets of the ith user respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 All blocks ofthe training and test sets are converted from text to numericvalues and then are normalized in [0 1] After that wesupplied these sets to the proposed PSO-based algorithm tofind the optimized hyperparameters vector119867119894 for the ith userIn addition to that we will save a copy of 119867119894 values in adatabase in order to save time and use them again in theRNN-experiment of that particular data configuration D aswill be presented in Section 512 The results extraction stagetakes place when constructing the DNN that is tuned by 119867119894trains the DNN on 119879119894 and tests the DNN on119885119894 The values ofthe classification outcomes True Positive (TP119894) False Positive(FP119894) True Negative (TN 119894) and False Negative (FN 119894) for theith user in the particular data configuration 119863 are extractedand saved for further processing later

Then the next user is observed and same procedure ofoptimization and results extraction stages is performed tillthe last user in the particular data configuration119863 is reachedFinally when all users in the particular data configurationare completed the last stage (finishing stage) is executedFinishing stage computes the summation of all obtained TPsof all users in the particular data configuration 119863 denotedby TP The same process will be applied also to the otheroutcomes namely FP TN and FN Equations (3) (4)

(5) and (6) express the formulas of TP FP TN and FNrespectively

119879119875 = 119872sum119894=1

119879119875119894 (3)

119865119875 = 119872sum119894=1

119865119875119894 (4)

119879119873 = 119872sum119894=1

119879119873119894 (5)

119865119873 = 119872sum119894=1

119865119873119894 (6)

The finishing stage will report and save these outcomes andend the DNN-experiment for the particular data configura-tion 119863 The former outcomes will be used to compute tenwell-known evaluation metrics to assess the performanceof the DNN on the particular data configuration D as willbe presented in Section 6 It is worth saying that the sameprocedure which is explained above will be done for eachdata configuration Figure 4 depicts the flowchart of themethodology of the DNN-experiments

512 Recurrent Neural Networks TheRecurrent Neural Net-work is a special type of the traditional feed-forwardArtificialNeural Network Unlike traditional ANN in the RNN eachneuron in any of the hidden layers has additional connectionsfrom its output to itself (self-recurrent) as well as to otherneurons of the same hidden layer Therefore the output ofthe RNNrsquos hidden layer at any time step (t) is for the currentinputs and the output of the hidden layer at the previous timestep (t-1) In RNN these directed cycles allow informationto circulate in the network and make the hidden layers asthe storage unit of the whole network [41] The importantcharacteristics of the RNN are the capability to have memoryand generate periodical sequences

Despite that the conventional RNN structure which isdescribed above has a serious problem especially when the

Security and Communication Networks 11

(9) Construct DNN that is tuned by Hi

(10) Train DNN on Ti

(11) Test DNN on Zi

No

(16) OutputTP FP TN and FN

Yes

End

Start

(1) Input Data configuration D M

(2) Set PSO parameters values

(3) Define Domains for Hyper-parameters

(3) Results Extraction Stage (4) Finishing Stage(1) Initialization Stage (2) Optimization Stage

(6) Execute the proposed PSO-based algorithm

(15) Compute and save TP FP TN and FN for D

(8) Database

(4) ilarr1

(7) Obtain Hi of the user Ui

(5) Create Ti and Zi sets of the user Ui

Hi

(12) Obtain and save TPi FPi TNi andFNi for the user Ui

(14) Is i gt M

(13) ilarri+1

Figure 4 The flowchart of the DNN-experiments

Inputxt

it ctℎt

Outputot

ft

Figure 5 The structure of an LSTM cell [6]

RNN is trained using the back-propagation technique Theproblem is known as gradient vanishing and exploding [42]The gradient vanishing problem occurs when the gradientsignal gets so small over the network which causes learningto become very slow or stop On the other hand the gradientexploding problem occurs when the gradient signal gets solarge in which learning divergesThis problem of the conven-tional RNN limited the use of the RNN to be only in short-termmemory tasks To solve this problem a new architectureof RNN is proposed by Hochreiter and Schmidhuber [43]known as Long Short-Term Memory (LSTM) LSTM uses anew structure called a memory cell that is composed of fourparts which are an input gate a neuron with a self-recurrentconnection a forget gate and the output gateMeanwhile themain goal of using a neuron with a self-recurrent connectionis to record information the aim of using three gates is tocontrol the flow of information from or into the memory cellThe input gate decides if to allow the incoming informationto enter into the memory cell or block it Moreover the forgetgate controls if to pass the previous state of the memory cellto alter the current state of the memory cell or prevent itFinally the output gate determines if to pass the output ofthe memory cell or not Figure 5 shows the structure of anLSTM memory cell Rather than overcoming the problemsof the conventional RNN LSTM model also outperformsthe conventional RNN in terms of performance especially inlong-term memory tasks [5] The LSTM-RNN model can beobtained by replacing every neuron in the hidden layers ofthe RNN to an LSTMmemory cell [6]

In this study we used the LSTM-RNN model to performa static masquerade detection task on all data configurationsAs mentioned in Section 511 there are six data config-urations and each of them will be used in the separate

experiment So we will have six separate LSTM-RNN-experiments each experiment will be on one of the dataconfigurations The methodology of all of these experimentsis the same and as follows for the given data configurationD we firstly prepared all the given data configurationrsquos filesby converting all blocks from text to numerical values andthen normalizing them in [0 1] Next to that for each user119880119894 in D where i=12 M and 119872 is the number of users inD we did the following steps we split the data of 119880119894 into twoindependent sets 119879119894 and 119885119894 which are the training and testsets of the ith user in D respectively The splitting processfollowed the structure of the particular data configurationwhich is described in Section 3 After that we retrieved thestored optimized hyperparameters vector of the ith user (119867119894)from the database which is created in the previous DNN-experiments Then we constructed the RNN model that istuned by119867119894 In order to obtain the LSTM-RNNmodel everyneuron in any of the hidden layers is replaced to an LSTMmemory cell The constructed LSTM-RNN model is trainedon119879119894 and then tested on119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 6 depicts theflowchart of the methodology of LSTM-RNN-experiments

52 Dynamic Classification Approach In contrast of staticclassification approach dynamic classification approach doesnot need a ready-to-use dataset with static features [30] Itcovenants directly with raw data sources such as text imagevideo sound and signal files and extracts features from themdynamically The models that use this approach try to learnand represent features in unsupervised manner Then thesemodels train themselves using the extracted features to beable to classify unseen dataThe deep learningmodels fit verywell for this approach because the main objectives of deeplearning models are the strong ability of automatic featureextraction and self-learning Rather than that dynamicclassification models overcome the problem of the lake ofdatasets it performs more efficient than the static classifica-tionmodels Despite these advantages dynamic classificationapproach has also drawbacks Dynamic classification modelsare slower and take a long time to train if compared with

12 Security and Communication Networks

YesNo

Hi

Start

(1) InputData configuration D M

(2) Prepare files of D

(4) Split data of Ui

into Ti and Zi sets

(7) Train LSTM-RNN model on Ti

(8) Test LSTM-RNN model on Zi

End

(5) Database

(6) Construct LSTM-RNN model that is tuned by Hi

(3) ilarr1

(9) Obtain and save TPi FPi TNi andFNi for the user Ui

(10) ilarri+1

(11) Is i gt M

(13) Output TPFP TN and FN

(12) Compute andsave TP FP TN

and FN for D

Figure 6 The flowchart of the LSTM-RNN-experiments

static classification models due to complex deep structure ofthesemodels as well as the huge amount of computations thatare required to execute Furthermore dynamic classificationmodels require a very large amount of input samples to gainhigh accuracy values

In this research we used six data configurations that areimplemented from three textual datasets In order to applydynamic masquerade detection on these data configurationswe need amodel that is able to extract features from the userrsquoscommand text file dynamically and then classify the user intoone of the two classes that will be either a normal user or amasqueraderTherefore we dealwith a text classification taskThe text classification is defined as a task that assigns a pieceof text (a word a sentence or even a document) to one ormore classes according to its content Indeed there are threetypes of text classification namely sentence classificationsentiment analysis and document categorization In sentenceclassification a given sentence should be assigned correctlyto one of possible classes Furthermore sentiment analysisdetermines if a given sentence is a positive negative orneutral towards a specific subject In contrast documentcategorization deals with documents and determines whichclass from a given set of possible classes a document belongsto According to the nature of dynamic classification as well asthe functionality of text classification deep learning modelsare the fittest among the other machine learning models forthese types of classification due to their powerful capability offeatures learning

A wide range of researches have been accomplished inthe literature in the field of text classification using deeplearning models It was started by LeCun et al in 1998 whenthey proposed a special topology of the Convolutional NeuralNetwork (CNN) known as LeNet family and used it in textclassification efficiently [44]Then various studies have beenpublished to introduce text classification algorithms as wellas the factors that impact the performance [45ndash47] In thestudy [48] the CNNmodel is used for sentence classificationtask over a set of text dataset benchmarks A single one-dimensional CNN is proposed to learn a region-based textembedding [49] X Zhang et al introduced a novel character-based multidimensional CNN for text classification taskswith competitive results [50] In the research [51] a newhierarchal approach calledHierarchal Deep Learning for Text

classification (HDLTex) is proposed and three deep struc-tures which are DNN RNN and CNN are used A recurrentconvolutional network model is introduced [52] for textclassification and high results are obtained on documents-level datasets A novel LSTM-based model is introduced andused for text classification withmultitask learning framework[53] The study [54] proposed a new model called hierarchalattention network for document classification and is testedon six large document-level datasets with good results Acharacter-level text representations approach is proposed andtested for text classification tasks using deep CNN [55]As noticed the CNN is the mostly used deep learningmodel for text classification tasks So we decided to use theCNN to perform dynamic masquerade detection on all dataconfigurations The following subsection reviews the CNNand explains the structure of the used CNN model and themethodology of our CNN-experiments

521 Convolutional Neural Networks The ConvolutionalNeural Network (CNN) is a deep learning model whichis biological-inspired from the animal visual cortex TheCNN can be considered as a special type of the traditionalfeed-forwardArtificial Neural NetworkThemajor differencebetween ANN and CNN is that instead of the fully connectedarchitecture of ANN the individual neurons in CNN areconnected to subregions of the input field The neurons ofthe CNN are arranged in such a way they are tilled to coverthe entire input field The typical CNN consists of five maincomponents namely an input layer the convolutional layerthe pooling layer the fully connected layer and an outputlayer The input layer is where the input data is enteredinto the CNN The first convolutional layer in the CNNconsists of individual neurons that each of them is connectedto a small subset of the input field The neurons in thenext convolutional layers connect only to a subset of theirpreceding pooling layerrsquos outputMoreover the convolutionallayers in the CNN use a set of learnable kernels or filters thateach filter is applied to the specified subset of their precedinglayerrsquos output These filters calculate feature maps in whicheach feature map shares the same weights The poolinglayer also known as a subsampling layer is a nonlineardownsampling function that condenses subsets of its inputThemain goal of using pooling layers in the CNN is to reduce

Security and Communication Networks 13

Userrsquos Command Text Files

Quantization

Input Layer

Convolutional layer

C1 features map P1 features map

Max-Pooling layer

C2 P2 C6 P6

Fully-Connected dropout layers

2048 sigmoid neurons

2048 sigmoid neurons 2

softmaxneurons

Outputdense layer

0 (Normal)1 (Masquerader)

Figure 7 The architecture of the used CNNmodel

the complexity and computations by reducing the size of theirpreceding layerrsquos output There are many pooling nonlinearfunctions that can be used but among them max-poolingis the mostly used which selects the maximum value in thegiven pooling window Typically each convolutional layer inthe CNN is followed by a max-pooling layer The CNN hasone or more stacked convolutional layer and max-poolinglayer pairs to extract features from the entire input and thenmap these features to their next fully connected layerThe toplayers of the CNN are one or more of fully connected layerswhich are similar to hidden layers in the DNN This meansthat neurons of the fully connected layers are connected to allneurons of the preceding layer The output layer is the finallayer in the CNN and is responsible for reporting the outputvalue of the CNN Finally the back-propagation algorithm isusually used to train CNNs via Stochastic Gradient Decent(SGD) to adjust the weights of the fully connected layers [56]There are several variant structures of CNN that are proposedin the literature but LeNet structure which is proposed byLeCun et al [44] is themost common approach used inmanyapplications of computer vision and text classification

Regarding its stability and high efficiency in text clas-sification we selected the CNN model which is proposedin [50] to perform a dynamic masquerade detection on alldata configurationsThe usedmodel is a character-level CNNthat takes a text file as input and outputs the classificationscore (0 if the input text file is related to a normal user or1 otherwise) The used CNN model is from LeNet familyand consists of an input layer followed by six convolutionand max-pooling pairs followed by two fully connectedlayers and finally followed by an output layer In the inputlayer the text quantization process takes place when theused model encodes all letters in the input text file using aone-hot representation from a 70-character alphabet All theconvolutional layers in the used CNN model have a ReLUnonlinear activation functionThe two fully connected layersin the used CNN model are of the type dropout layer withdropout probability equal to 05 In addition to that the twofully connected layers in the usedCNNmodel have a Sigmoidnonlinear activation function as well as they have the samesize of 2048 neurons of each The output layer in the usedCNN model is of the type dense layer as well as it has asoftmax activation function and size of two neurons Theused CNN model is trained by back-propagation algorithmvia SGD Finally we set the following parameters to the

used CNN model learning rate=001 epochs=30 and batchsize=64 These values are obtained experimentally by per-forming a grid search to find the best possible values of theseparameters Figure 7 shows the architecture of the used CNNmodel and is reproduced from Zhang et al (2015) [under theCreative Commons Attribution Licensepublic domain]

In our work we used a CNNmodel to perform a dynamicmasquerade detection task on all data configurations Asmentioned in Section 511 there are six data configurationsand each of them will be used in the separate experimentSo we will have six separate CNN-experiments and eachexperiment will be on one of the data configurations Themethodology of all of these experiments is the same and asfollows for the given data configurationD we firstly preparedall the given data configurationrsquos text files such that each file ofthem represents the training and test sets of a user in119863 Nextto that for each user 119880119894 in D where i=12 M and119872 is thenumber of users in D we did the following steps we split thedata of 119880119894 into two independent sets 119879119894 and 119885119894 which are thetraining and test sets of the ith user in D respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 Furthermorewe also moved each block in the training and test sets of theuser 119880119894 to a separate text file This means that each of thetraining and test sets of the user 119880119894 consists of a specifiednumber of text files in which each text file contains one blockof UNIX commands After that we constructed the usedCNN model The constructed CNN model is trained on 119879119894and then tested on 119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 8 depicts theflowchart of the methodology of CNN-experiments

6 Results and Discussion

We carried out three major empirical experiments whichareDNN-experiments LSTM-RNN-experiments andCNN-experiments Each of them consists of six separate subex-periments where each subexperiment is performed on oneof the data configurations SEA SEA 1v49 Greenberg Trun-cated Greenberg Enriched PU Truncated and PU Enriched

14 Security and Communication Networks

YesNo

Start

(1) Input

(2) Prepare text files of D

(4) Split data of Ui

Ti and Zi text sets(6) Construct the used CNN model

(7) Train CNN model on Ti

(8) Test CNN model on Zi

(13) Output TP FP TN and FNEnd

(5) Move each block in Ti and Zi to a separate text file

Data configuration D M

(3) ilarr1

(12) Compute and save TPFP TN and FN for D

(9) Obtain and save TPi FPi TNiand FNi for the user Ui

(11) Is i gt M

(10) ilarri+1

into

Figure 8 The flowchart of the CNN-experiments

Table 6 The confusion matrix of the masquerade detection out-comes

Actual Class Predicted ClassNormal User Masquerader

Normal User TN FPMasquerader FN TP

Basically our PSO-based DNN hyperparameters selectionalgorithmwas implemented in Python 364 [57]withNumPy[58] Moreover all models (DNN LSTM-RNN CNN) wereconstructed and trained and tested based on Keras [59 60]with TensorFlow 16 [61 62] that backend over CUDA 90[63] and cuDNN 70 [64] In addition to that all experimentswere performed on a workstation with an Intel Core i7 CPU(38GHz 16 MB Cache) 16GB of RAM and theWindows 10operating system In order to accelerate the computations inall experiments we also used a GPU-accelerated computingwith NVIDIA Tesla K20 GPU 5GB GDDR5The experimen-tal environment is processed in 64-bit mode

In any classification task we have four possible outcomesTrue Positive (TP) True Negative (TN) False Positive (FP)and False Negative (FN) We get a TP when a masqueraderis correctly classified as a masquerader Whenever a gooduser is correctly classified as a good user itself we say it isa TN A FP occurs when a good user is misclassified as amasquerader In contrast FN occurs when a masqueraderis misclassified as a good user Table 6 shows the ConfusionMatrix of the masquerade detection outcomes For eachdata configuration we used the obtained outcomes for thatdata configuration to compute twelve well-known evaluationmetrics After that by using these evaluation metrics weassessed the performance of each deep learningmodel on thatdata configuration

For simplicity we divided these evaluation metrics intotwo categories General Classification Measures and Mas-querade Detection Measures The General ClassificationMeasures are metrics that are used for any classification tasknamely Accuracy Precision Recall and F1-Score On theother handMasquerade DetectionMeasures are metrics thatusually are used for a masquerade or intrusion detection

task which are Hit Rate Miss Rate False Alarm RateCost Bayesian Detection Rate Bayesian True Negative RateGeometric Mean andMatthews Correlation CoefficientTheused evaluation metrics definition and their correspondingequations are as follows

(i) Accuracy shows the rate of true detection over all testsets

119860119888119888119906119903119886119888119910 = 119879119875 + 119879119873119879119875 + 119879119873 + 119865119875 + 119865119873 (7)

(ii) Precision shows the rate of correctly classified mas-queraders from all blocks in the test set that areclassified as masqueraders

119875119903119890119888119894119904119894119900119899 = 119879119875119879119875 + 119865119875 (8)

(iii) Recall shows the rate of correctly classified masquer-aders over all masquerader blocks in the test set

119877119890119888119886119897119897 = 119879119875119879119875 + 119865119873 (9)

(iv) F1-Score gives information about the accuracy of aclassifier regarding both Precision (P) and Recall (R)metrics

1198651 119878119888119900119903119890 = 21119875 + 1119877 (10)

(v) Hit Rate shows the rate of correctly classified mas-querader blocks over all masquerader blocks pre-sented in the test set It is also called Hits TruePositive Rate or Detection Rate

119867119894119905 119877119886119905119890 = 119879119875119879119875 + 119865119873 (11)

(vi) Miss Rate is the complement of Hit Rate (Miss=100-Hit) ie it shows the rate of masquerade blocksthat are misclassified as a normal user from allmasquerade blocks in the test set It is also calledMisses or False Negative Rate

119872119894119904119904 119877119886119905119890 = 119865119873119865119873 + 119879119875 (12)

Security and Communication Networks 15

(vii) False Alarm Rate (FAR) gives information about therate of normal user blocks that are misclassified as amasquerader over all normal user blocks presented inthe test set It is also called False Positive Rate

119865119886119897119904119890 119860119897119886119903119898 119877119886119905119890 = 119865119875119865119875 + 119879119873 (13)

(viii) Cost is a metric that was proposed in [9] to evaluatethe efficiency of a classifier concerning bothMiss Rate(MR) and False Alarm Rate (FAR) metrics

119862119900119904119905 = 119872119877 + 6 times 119865119860119877 (14)

(ix) Bayesian Detection Rate (BDR) is a metric basedon Base-Rate Fallacy problem which is addressedby S Axelsson in 1999 [65] Base-Rate Fallacy is abasis of Bayesian statistics and occurs when peo-ple do not take the basic rate of incidence (Base-Rate) into their account when solving problems inprobabilities Unlike Hit Rate metric BDR shows therate of correctly classified masquerader blocks overall test set taking into consideration the base-rate ofmasqueraders Let I and Ilowast denote a masquerade anda normal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectivelyThen BDR can be computed asthe probability P(I | A) according to (15) [65]119861119886119910119890119904119894119886119899 119863119890119905119890119888119905119894119900119899 119877119886119905119890 = 119875 (119868 | 119860)

= 119875 (119868) times 119875 (119860 | 119868)119875 (119868) times 119875 (119860 | 119868) + 119875 (119868lowast) times 119875 (119860 | 119868lowast)(15)

P(I) is the rate of the masquerader blocks in the testset P(A | I) is the Hit Rate P(Ilowast) is the rate of thenormal blocks in the test set and P(A | Ilowast) is the FAR

(x) Bayesian True Negative Rate (BTNR) is also basedon Base-Rate Fallacy and shows the rate of trulyclassified normal blocks over all test set in which thepredicted normal behavior indicates really a normaluser [65] Let I and Ilowast denote a masquerade and anormal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectively Then BTNR can be computedas the probability P(Ilowast | Alowast) according to (16) [65]

119861119886119910119890119904119894119886119899 119879119903119906119890 119873119890119892119886119905119894V119890 119877119886119905119890 = 119875 (119868lowast | 119860lowast)= 119875 (119868lowast) times 119875 (119860lowast | 119868lowast)

119875 (119868lowast) times 119875 (119860lowast | 119868lowast) + 119875 (119868) times 119875 (119860lowast | 119868)(16)

P(Ilowast) is the rate of the normal blocks in the test setP(Alowast | Ilowast) is the True Negative Rate which is easilyobtained by calculating (1-FAR) P(I) is the rate of themasquerader blocks in the test set and P(Alowast | I) isthe Miss Rate

(xi) Geometric Mean (g-mean) is a performance metricthat combines true negative rate and true positive

rate at one specific threshold where both the errorsare considered equal This metric has been usedby several researchers for evaluating classifiers onimbalance dataset [66] It can be computed accordingto (17) [67]

119892 119898119890119886119899 = radic 119879119875 times 119879119873(119879119875 + 119865119873) times (119879119873 + 119865119875) (17)

(xii) Matthews Correlation Coefficient (MCC) is a perfor-mance metric that takes into account true and falsepositives and negatives and is generally regarded asa balanced measure which can be used even if theclasses are of very different sizes (imbalance dataset)[68] MCC has a range of minus1 to 1 where minus1 indicates acompletely wrong binary classifier while 1 indicates acompletely correct binary classifier Unlike the othermetrics discussed aboveMCC takes all the cells of theConfusion Matrix into consideration in its formulawhich can be computed according to (18) [69]

119872119862119862= (119879119875 times 119879119873) minus (119865119875 times 119865119873)radic(119879119875 + 119865119873) times (119879119875 + 119865119875) times (119879119873 + 119865119875) times (119879119873 + 119865119873)

(18)

In the following two subsections we will present our experi-mental results and explain them using two kinds of analysesperformance analysis and ROC curves analysis

61 Performance Analysis The effectiveness of any modelto detect masqueraders depends on its values of evaluationmetrics The higher values of Accuracy Precision RecallF1-Score Hit Rate Bayesian Detection Rate Bayesian TrueNegative Rate Geometric Mean and Matthews CorrelationCoefficient as well as the lower values of Miss Rate FalseAlarm Rate and Cost indicate an efficient classifierThe idealclassifier hasAccuracy andHit Rate values that reach 1 as wellasMiss Rate and False AlarmRate values that reach 0 Table 7presents the percentages of the used evaluation metricsfor DNN-experiments LSTM-RNN-experiments and CNN-experiments Actually the rows labeled by DNN and LSTM-RNN in Table 7 show results of the static masquerade detec-tion by using DNN and LSTM-RNN models respectivelywhereas the rows labeled by CNN in Table 7 show resultsof the dynamic masquerade detection by using CNN modelFurthermore the bold rows represent the best results amongthe same data configuration whereas the underlined valuesare the best for all data configurations

First of all the impact of using our PSO-based algorithmcan be seen in the obtained results of both DNN and LSTM-RNN models The PSO-based algorithm is used to optimizethe selection of DNN hyperparameters that maximized theaccuracy which means that the sum of TP and TN outcomeswill be increased significantly Thus according to (11) and(13) increasing the sum of TP and TN will lead definitelyto the increase of the value of Hit as well as to the decreaseof the value of FAR Although the accuracy values of SEA1v49 data configuration for all models are slightly lower than

16 Security and Communication Networks

Table 7 The results of our experiments

Dataset DataConfiguration Model Evaluation Metrics ()

Accuracy Precision Recall F1-Score Hit Miss FAR Cost BDR BTNR g-mean MCC

SEA Dataset

SEADNN 9808 7626 8485 8033 8485 1515 128 2283 7625 9926 9152 7945

LSTM-RNN 9852 8230 8658 8439 8658 1342 090 1883 8233 9934 9263 8364CNN 9884 8777 8701 8739 8701 1299 059 1651 8772 9937 93 8678

SEA 1v49DNN 9654 9998 9643 9817 9643 357 048 647 9998 5204 9796 7064

LSTM-RNN 9786 9998 9779 9887 9779 221 038 448 9998 6370 987 7874CNN 9878 9999 9874 9936 9874 126 019 240 9999 7551 9927 8622

GreenbergDataset

GreenbergTruncated

DNN 9397 9223 8067 8606 8067 1933 204 3157 9222 9441 8889 8253LSTM-RNN 9472 9488 8153 8770 8153 1847 132 2639 9487 9468 897 8476

CNN 9543 9616 8353 8940 8353 1647 10 2247 9616 9524 9094 8686

GreenbergEnriched

DNN 9757 9692 9240 9461 9240 760 088 1288 9692 9775 957 9308LSTM-RNN 9798 9757 9360 9554 9360 640 070 1060 9756 9810 9641 9428

CNN 9860 9855 9533 9692 9533 467 042 719 9855 9861 9743 9603

PU Dataset

PU TruncatedDNN 810 9959 7861 8786 7861 2139 225 3489 9959 3949 8766 5463

LSTM-RNN 8219 9969 7989 8870 7989 2011 175 3061 9968 4110 886 5646CNN 8375 9974 8164 8979 8164 1836 150 2736 9973 4338 8968 5879

PU EnrichedDNN 9044 9984 8921 9423 8921 1079 10 1679 9984 5672 9398 7064

LSTM-RNN 9131 9988 9018 9478 9018 982 075 1432 9988 5908 9461 7261CNN 9375 9992 9293 9630 9293 707 050 1007 9992 6678 9616 7852

the corresponding values of SEA data configuration also Hitvalues are dramatically increased in SEA 1v49 for all modelsby 10-14 from those that are in the SEA data configurationThis is due to the structure of SEA 1v49 data configurationwhere there are 122500 masquerader blocks in the test setof SEA 1v49 comparing to only 231 blocks in the SEA dataconfiguration Moreover the FAR values of SEA 1v49 for allmodels are significantly lower than the corresponding valuesof SEA data configuration Hence regarding SEA datasetSEA 1v49 is better to use in masquerade detection than SEAdata configuration

On the other hand as we expected Greenberg Enrichedenhanced noticeably the performance of all models in termsof all used evaluation metrics from the corresponding val-ues of Greenberg Truncated data configuration This canbe explained by the fact that Greenberg Enriched dataconfiguration has more information about user behaviorincluding command name parameters aliases and flagscomparing to only command name in Greenberg TruncatedTherefore regarding Greenberg dataset Greenberg Enricheddata configuration is better to use in masquerade detectionthan Greenberg Truncated The same thing happened inPU dataset where its PU Enriched data configuration hasbetter results regarding all models than PU Truncated Thusregarding PU dataset PU Enriched is better to use inmasquerade detection than PUTruncated data configuration

Actually PU Truncated and Greenberg Truncated dataconfigurations simulate SEA and SEA 1v49 data configu-rations where only command name is considered Despitethat regarding all used models SEA 1v49 recorded thebest results among the other truncated data configurationsOn the other hand PU Enriched and Greenberg Enriched

are considered as enriched data configurations where extrainformation about users is taken into consideration Due tothat enriched data configurations help models to build userrsquosbehavior profile more accurately than with truncated dataconfigurations Regarding all models the results associatedwithGreenberg Enriched especially in terms ofAccuracyHitand FAR values are better than of the corresponding valuesof PU Enriched data configuration because PU dataset isvery small masquerade detection dataset with a relatively lownumber of users (only 8 users) Also this reason can explainwhy a few previous works used PU dataset in masqueradedetection However data configurations can be sort for allused models from the upper to lower according to theobtained results as follows SEA 1v49 Greenberg EnrichedPU Enriched SEA Greenberg Truncated and PUTruncated

For the sake of brevity and space limitation we selected asubset of the used performancemetrics inTable 7 to be shownvisually in Figures 9 and 10 Figures 9(a) 9(b) 9(c) 9(d)9(e) 9(f) 9(g) and 9(h) showAccuracy HitMiss FAR CostBDR F1-Score and MCC percentages of the used modelsin each data configuration respectively Figures 10(a) 10(b)10(c) 10(d) 10(e) and 10(f) show Accuracy Hit FAR BDRF1-Score and MCC percentages for the average performanceof the used models on datasets respectively Figures 9 and10 can give us a visual comparison of the performance of theused deep learning models for each data configuration anddataset as well as in all datasets

By taking an inspective look to Figures 9 and 10 we cannotice the stability of deep learning models in such a waythat they are enhancing masquerade detection from a dataconfiguration to another in a consistent pattern To explainthat we will discuss the obtained results from the perspective

Security and Communication Networks 17

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnriched

PU EnrichedPU Truncated

0102030405060708090

100

Accura

cy (

)

(a)

0102030405060708090

100

Hit

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(b)

0

5

10

15

20

25

Miss

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(c)

002040608

112141618

22224

FAR

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(d)

0

5

10

15

20

25

30

35

Cos

t (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(e)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU Truncated

0102030405060708090

100

BDR

()

PU Enriched

(f)

Figure 9 Continued

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 9: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

Security and Communication Networks 9

Table 4 PSO parameters recommended values or ranges

Parameter ValueRange119878 [5 20]119881119898119894119899 0119881119898119886119909 11198621 21198622 2119882 [04 09]119905119898119886119909 [30 50] 00001

119866119878119901119903V larr997888 119866119878End of for

Step 6 Let119867 be the optimal hyperparameters vector119867 larr997888 119866119887119890119904119905Return119867 and Terminate

44 PSO Parameters Selection of the value of PSO param-eters (S V119898119886119909 V119898119894119899 1198621 1198622 W t119898119886119909 ) is a very complexprocess Fortunatelymany empirical and theoretical previousstudies have been published to solve this problem [37ndash40] They introduced some recommended values of PSOparameters which can be taken Table 4 shows every PSOparameter and the corresponding recommended value orrange Thus for those parameters which have recommendedranges we can select a value for each parameter from its rangerandomly and fix it as a constant during the execution of PSO

5 Experimental Setup and Models

This section explains the methodology of performing ourempirical experiments as well as the description of deeplearning models which we used to detect masquerades Asmentioned in Section 3 we selected three UNIX commandline-based datasets (SEA Greenberg PU) Each of thesedatasets is a collection of text files inwhich each text file repre-sents a userThe text file of each user in the particular datasetcontains a set of UNIX commands that are issued by that userThis reflects the fact that these datasets do not contain anyreal masqueraders However to simulate masqueraders andto use these datasets in masquerade detection special dataconfigurations must be implemented prior to proceeding inour experiments According to Section 3 and its subsectionseach dataset has its two different types of data configurationsTherefore we obtained six data configurations that each onewill be observed separately which yields in the result to sixindependent experiments for each model Finally masquer-ade detection can be applied to these data configurationsby following two different main approaches namely staticclassification and dynamic classificationThe two subsequentsubsections present the difference between them as well aswhich deep learning models are exploited for each one

51 Static Classification Approach In the static classificationapproach the classification task is carried out using a dataset

of samples which are represented by a set of static features[30] These static features are defined according to the natureof the task where the classification will be applied In additionto that the dataset samples or also called observations arecollected manually by some experts working in the field ofthat classification task After that these samples are split intotwo independent sets known as training and test sets to trainand test the selected model respectively Static classificationapproach has pros and cons as well Although it provides afaster and easier solution it requires a ready-to-use datasetwith static features The existence of such dataset might notbe available in some complex classification tasks Hence theattempt to create a dataset with static features will be a hardmission In our work we decided to utilize the existenceof three famous UNIX command line-based datasets toimplement six different data configurations Each user inthe particular data configuration has a specific number ofblocks which are represented by a set of static featuresIndeed these features are the userrsquos UNIX commands incharge of describing the behavior of that user and laterhelping the classifier to detect masquerades We decided touse two well-known deep learning models namely DeepNeural Networks (DNN) and Recurrent Neural Networks(RNN) to accomplish the staticmasquerade detection task onthe implemented six data configurations

511 Deep Neural Networks In Section 4 we explained indetail the DNN structure and the problem of the selection ofits hyperparameters We also proposed PSO-based algorithmto obtain the optimal hyperparameters vector thatmaximizedthe accuracy of the DNN on the given training and test setsIn this subsection we describe how we utilized the proposedPSO-based algorithm and the DNN in static masqueradedetection task using six of data configurations which areSEA SEA 1v49 Greenberg Truncated Greenberg EnrichedPU Truncated and PU Enriched Every data configurationof them has its structure and a specific number of users asdescribed in Section 3 So we will have six separate DNN-experiments and each experiment will be on one of the dataconfigurations

The methodology of our DNN-experiments consists offour consecutive stages which are initialization optimiza-tion results extraction and finishing stages The first stageis to initialize all required operating parameters as well asto prepare the particular data configurationrsquos files in whicheach file represents a user in that data configurationThe userfile consists of the training set followed by the test set of thatuser We set all PSO parameters for all DNN-experiments asfollows S=20 V119898119894119899=0 V119898119886119909= 1 1198621=1198622=2 W=09 t119898119886119909=30and =10minus4 Then the last step in the initialization stage is todefine hyperparameters of the DNN and their domains Weused twelve different DNN hyperparameters (N=12) Table 5shows each DNN hyperparameter and its correspondingdefined domain All the used hyperparameters are numericalexcept that Optimizer Layer type Initialization function andActivation function hyperparameters are categorical In thiscase a list of all possible values is indexed to a sequenced-numbered range from 1 to the length of that list Optimizerlist includes elements Adagrad Nadam Adam Adamax

10 Security and Communication Networks

Table 5 The used DNN hyperparameters and their domains

Hyperparameter Domain DescriptionLearning rate [001 09] ContinuousMomentum [01 09] ContinuousDecay [0001 001 ContinuousDropout rate [01 09] ContinuousNumber of hidden layers [1 10] Discrete with step=1Numbers of neurons of hidden layers [1 100] Discrete with step=1Number of epochs [5 20] Discrete with step=5Batch size [100 1000] Discrete with step=50Optimizer [1 6] Discrete with step=1Initialization function [1 8] Discrete with step=1Layer type [1 2] Discrete with step=1Activation function [1 8] Discrete with step=1

RMSprop and SGD Layer type list contains two elementswhich are Dropout and Dense Initialization function listincludes elements Zero Normal Lecun uniform UniformGlorot uniform Glorot normal He uniform and He normalFinally Activation list has eight elements which are LinearSoftmax ReLU Sigmoid Tanh Hard Sigmoid Softsign andSoftplus It is worth mentioning that the elements of all cate-gorical hyperparameters are defined inKeras implementation[30]

The optimization and results extraction stages will beperformed once for each user in the particular data configu-ration that is they will be repeated for each user119880119894 i=12 M where 119872 is the number of users in the particular dataconfiguration119863The optimization stage starts by splitting thedata of the user119880119894 into two independent sets119879119894 and119885119894 whichare the training and test sets of the ith user respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 All blocks ofthe training and test sets are converted from text to numericvalues and then are normalized in [0 1] After that wesupplied these sets to the proposed PSO-based algorithm tofind the optimized hyperparameters vector119867119894 for the ith userIn addition to that we will save a copy of 119867119894 values in adatabase in order to save time and use them again in theRNN-experiment of that particular data configuration D aswill be presented in Section 512 The results extraction stagetakes place when constructing the DNN that is tuned by 119867119894trains the DNN on 119879119894 and tests the DNN on119885119894 The values ofthe classification outcomes True Positive (TP119894) False Positive(FP119894) True Negative (TN 119894) and False Negative (FN 119894) for theith user in the particular data configuration 119863 are extractedand saved for further processing later

Then the next user is observed and same procedure ofoptimization and results extraction stages is performed tillthe last user in the particular data configuration119863 is reachedFinally when all users in the particular data configurationare completed the last stage (finishing stage) is executedFinishing stage computes the summation of all obtained TPsof all users in the particular data configuration 119863 denotedby TP The same process will be applied also to the otheroutcomes namely FP TN and FN Equations (3) (4)

(5) and (6) express the formulas of TP FP TN and FNrespectively

119879119875 = 119872sum119894=1

119879119875119894 (3)

119865119875 = 119872sum119894=1

119865119875119894 (4)

119879119873 = 119872sum119894=1

119879119873119894 (5)

119865119873 = 119872sum119894=1

119865119873119894 (6)

The finishing stage will report and save these outcomes andend the DNN-experiment for the particular data configura-tion 119863 The former outcomes will be used to compute tenwell-known evaluation metrics to assess the performanceof the DNN on the particular data configuration D as willbe presented in Section 6 It is worth saying that the sameprocedure which is explained above will be done for eachdata configuration Figure 4 depicts the flowchart of themethodology of the DNN-experiments

512 Recurrent Neural Networks TheRecurrent Neural Net-work is a special type of the traditional feed-forwardArtificialNeural Network Unlike traditional ANN in the RNN eachneuron in any of the hidden layers has additional connectionsfrom its output to itself (self-recurrent) as well as to otherneurons of the same hidden layer Therefore the output ofthe RNNrsquos hidden layer at any time step (t) is for the currentinputs and the output of the hidden layer at the previous timestep (t-1) In RNN these directed cycles allow informationto circulate in the network and make the hidden layers asthe storage unit of the whole network [41] The importantcharacteristics of the RNN are the capability to have memoryand generate periodical sequences

Despite that the conventional RNN structure which isdescribed above has a serious problem especially when the

Security and Communication Networks 11

(9) Construct DNN that is tuned by Hi

(10) Train DNN on Ti

(11) Test DNN on Zi

No

(16) OutputTP FP TN and FN

Yes

End

Start

(1) Input Data configuration D M

(2) Set PSO parameters values

(3) Define Domains for Hyper-parameters

(3) Results Extraction Stage (4) Finishing Stage(1) Initialization Stage (2) Optimization Stage

(6) Execute the proposed PSO-based algorithm

(15) Compute and save TP FP TN and FN for D

(8) Database

(4) ilarr1

(7) Obtain Hi of the user Ui

(5) Create Ti and Zi sets of the user Ui

Hi

(12) Obtain and save TPi FPi TNi andFNi for the user Ui

(14) Is i gt M

(13) ilarri+1

Figure 4 The flowchart of the DNN-experiments

Inputxt

it ctℎt

Outputot

ft

Figure 5 The structure of an LSTM cell [6]

RNN is trained using the back-propagation technique Theproblem is known as gradient vanishing and exploding [42]The gradient vanishing problem occurs when the gradientsignal gets so small over the network which causes learningto become very slow or stop On the other hand the gradientexploding problem occurs when the gradient signal gets solarge in which learning divergesThis problem of the conven-tional RNN limited the use of the RNN to be only in short-termmemory tasks To solve this problem a new architectureof RNN is proposed by Hochreiter and Schmidhuber [43]known as Long Short-Term Memory (LSTM) LSTM uses anew structure called a memory cell that is composed of fourparts which are an input gate a neuron with a self-recurrentconnection a forget gate and the output gateMeanwhile themain goal of using a neuron with a self-recurrent connectionis to record information the aim of using three gates is tocontrol the flow of information from or into the memory cellThe input gate decides if to allow the incoming informationto enter into the memory cell or block it Moreover the forgetgate controls if to pass the previous state of the memory cellto alter the current state of the memory cell or prevent itFinally the output gate determines if to pass the output ofthe memory cell or not Figure 5 shows the structure of anLSTM memory cell Rather than overcoming the problemsof the conventional RNN LSTM model also outperformsthe conventional RNN in terms of performance especially inlong-term memory tasks [5] The LSTM-RNN model can beobtained by replacing every neuron in the hidden layers ofthe RNN to an LSTMmemory cell [6]

In this study we used the LSTM-RNN model to performa static masquerade detection task on all data configurationsAs mentioned in Section 511 there are six data config-urations and each of them will be used in the separate

experiment So we will have six separate LSTM-RNN-experiments each experiment will be on one of the dataconfigurations The methodology of all of these experimentsis the same and as follows for the given data configurationD we firstly prepared all the given data configurationrsquos filesby converting all blocks from text to numerical values andthen normalizing them in [0 1] Next to that for each user119880119894 in D where i=12 M and 119872 is the number of users inD we did the following steps we split the data of 119880119894 into twoindependent sets 119879119894 and 119885119894 which are the training and testsets of the ith user in D respectively The splitting processfollowed the structure of the particular data configurationwhich is described in Section 3 After that we retrieved thestored optimized hyperparameters vector of the ith user (119867119894)from the database which is created in the previous DNN-experiments Then we constructed the RNN model that istuned by119867119894 In order to obtain the LSTM-RNNmodel everyneuron in any of the hidden layers is replaced to an LSTMmemory cell The constructed LSTM-RNN model is trainedon119879119894 and then tested on119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 6 depicts theflowchart of the methodology of LSTM-RNN-experiments

52 Dynamic Classification Approach In contrast of staticclassification approach dynamic classification approach doesnot need a ready-to-use dataset with static features [30] Itcovenants directly with raw data sources such as text imagevideo sound and signal files and extracts features from themdynamically The models that use this approach try to learnand represent features in unsupervised manner Then thesemodels train themselves using the extracted features to beable to classify unseen dataThe deep learningmodels fit verywell for this approach because the main objectives of deeplearning models are the strong ability of automatic featureextraction and self-learning Rather than that dynamicclassification models overcome the problem of the lake ofdatasets it performs more efficient than the static classifica-tionmodels Despite these advantages dynamic classificationapproach has also drawbacks Dynamic classification modelsare slower and take a long time to train if compared with

12 Security and Communication Networks

YesNo

Hi

Start

(1) InputData configuration D M

(2) Prepare files of D

(4) Split data of Ui

into Ti and Zi sets

(7) Train LSTM-RNN model on Ti

(8) Test LSTM-RNN model on Zi

End

(5) Database

(6) Construct LSTM-RNN model that is tuned by Hi

(3) ilarr1

(9) Obtain and save TPi FPi TNi andFNi for the user Ui

(10) ilarri+1

(11) Is i gt M

(13) Output TPFP TN and FN

(12) Compute andsave TP FP TN

and FN for D

Figure 6 The flowchart of the LSTM-RNN-experiments

static classification models due to complex deep structure ofthesemodels as well as the huge amount of computations thatare required to execute Furthermore dynamic classificationmodels require a very large amount of input samples to gainhigh accuracy values

In this research we used six data configurations that areimplemented from three textual datasets In order to applydynamic masquerade detection on these data configurationswe need amodel that is able to extract features from the userrsquoscommand text file dynamically and then classify the user intoone of the two classes that will be either a normal user or amasqueraderTherefore we dealwith a text classification taskThe text classification is defined as a task that assigns a pieceof text (a word a sentence or even a document) to one ormore classes according to its content Indeed there are threetypes of text classification namely sentence classificationsentiment analysis and document categorization In sentenceclassification a given sentence should be assigned correctlyto one of possible classes Furthermore sentiment analysisdetermines if a given sentence is a positive negative orneutral towards a specific subject In contrast documentcategorization deals with documents and determines whichclass from a given set of possible classes a document belongsto According to the nature of dynamic classification as well asthe functionality of text classification deep learning modelsare the fittest among the other machine learning models forthese types of classification due to their powerful capability offeatures learning

A wide range of researches have been accomplished inthe literature in the field of text classification using deeplearning models It was started by LeCun et al in 1998 whenthey proposed a special topology of the Convolutional NeuralNetwork (CNN) known as LeNet family and used it in textclassification efficiently [44]Then various studies have beenpublished to introduce text classification algorithms as wellas the factors that impact the performance [45ndash47] In thestudy [48] the CNNmodel is used for sentence classificationtask over a set of text dataset benchmarks A single one-dimensional CNN is proposed to learn a region-based textembedding [49] X Zhang et al introduced a novel character-based multidimensional CNN for text classification taskswith competitive results [50] In the research [51] a newhierarchal approach calledHierarchal Deep Learning for Text

classification (HDLTex) is proposed and three deep struc-tures which are DNN RNN and CNN are used A recurrentconvolutional network model is introduced [52] for textclassification and high results are obtained on documents-level datasets A novel LSTM-based model is introduced andused for text classification withmultitask learning framework[53] The study [54] proposed a new model called hierarchalattention network for document classification and is testedon six large document-level datasets with good results Acharacter-level text representations approach is proposed andtested for text classification tasks using deep CNN [55]As noticed the CNN is the mostly used deep learningmodel for text classification tasks So we decided to use theCNN to perform dynamic masquerade detection on all dataconfigurations The following subsection reviews the CNNand explains the structure of the used CNN model and themethodology of our CNN-experiments

521 Convolutional Neural Networks The ConvolutionalNeural Network (CNN) is a deep learning model whichis biological-inspired from the animal visual cortex TheCNN can be considered as a special type of the traditionalfeed-forwardArtificial Neural NetworkThemajor differencebetween ANN and CNN is that instead of the fully connectedarchitecture of ANN the individual neurons in CNN areconnected to subregions of the input field The neurons ofthe CNN are arranged in such a way they are tilled to coverthe entire input field The typical CNN consists of five maincomponents namely an input layer the convolutional layerthe pooling layer the fully connected layer and an outputlayer The input layer is where the input data is enteredinto the CNN The first convolutional layer in the CNNconsists of individual neurons that each of them is connectedto a small subset of the input field The neurons in thenext convolutional layers connect only to a subset of theirpreceding pooling layerrsquos outputMoreover the convolutionallayers in the CNN use a set of learnable kernels or filters thateach filter is applied to the specified subset of their precedinglayerrsquos output These filters calculate feature maps in whicheach feature map shares the same weights The poolinglayer also known as a subsampling layer is a nonlineardownsampling function that condenses subsets of its inputThemain goal of using pooling layers in the CNN is to reduce

Security and Communication Networks 13

Userrsquos Command Text Files

Quantization

Input Layer

Convolutional layer

C1 features map P1 features map

Max-Pooling layer

C2 P2 C6 P6

Fully-Connected dropout layers

2048 sigmoid neurons

2048 sigmoid neurons 2

softmaxneurons

Outputdense layer

0 (Normal)1 (Masquerader)

Figure 7 The architecture of the used CNNmodel

the complexity and computations by reducing the size of theirpreceding layerrsquos output There are many pooling nonlinearfunctions that can be used but among them max-poolingis the mostly used which selects the maximum value in thegiven pooling window Typically each convolutional layer inthe CNN is followed by a max-pooling layer The CNN hasone or more stacked convolutional layer and max-poolinglayer pairs to extract features from the entire input and thenmap these features to their next fully connected layerThe toplayers of the CNN are one or more of fully connected layerswhich are similar to hidden layers in the DNN This meansthat neurons of the fully connected layers are connected to allneurons of the preceding layer The output layer is the finallayer in the CNN and is responsible for reporting the outputvalue of the CNN Finally the back-propagation algorithm isusually used to train CNNs via Stochastic Gradient Decent(SGD) to adjust the weights of the fully connected layers [56]There are several variant structures of CNN that are proposedin the literature but LeNet structure which is proposed byLeCun et al [44] is themost common approach used inmanyapplications of computer vision and text classification

Regarding its stability and high efficiency in text clas-sification we selected the CNN model which is proposedin [50] to perform a dynamic masquerade detection on alldata configurationsThe usedmodel is a character-level CNNthat takes a text file as input and outputs the classificationscore (0 if the input text file is related to a normal user or1 otherwise) The used CNN model is from LeNet familyand consists of an input layer followed by six convolutionand max-pooling pairs followed by two fully connectedlayers and finally followed by an output layer In the inputlayer the text quantization process takes place when theused model encodes all letters in the input text file using aone-hot representation from a 70-character alphabet All theconvolutional layers in the used CNN model have a ReLUnonlinear activation functionThe two fully connected layersin the used CNN model are of the type dropout layer withdropout probability equal to 05 In addition to that the twofully connected layers in the usedCNNmodel have a Sigmoidnonlinear activation function as well as they have the samesize of 2048 neurons of each The output layer in the usedCNN model is of the type dense layer as well as it has asoftmax activation function and size of two neurons Theused CNN model is trained by back-propagation algorithmvia SGD Finally we set the following parameters to the

used CNN model learning rate=001 epochs=30 and batchsize=64 These values are obtained experimentally by per-forming a grid search to find the best possible values of theseparameters Figure 7 shows the architecture of the used CNNmodel and is reproduced from Zhang et al (2015) [under theCreative Commons Attribution Licensepublic domain]

In our work we used a CNNmodel to perform a dynamicmasquerade detection task on all data configurations Asmentioned in Section 511 there are six data configurationsand each of them will be used in the separate experimentSo we will have six separate CNN-experiments and eachexperiment will be on one of the data configurations Themethodology of all of these experiments is the same and asfollows for the given data configurationD we firstly preparedall the given data configurationrsquos text files such that each file ofthem represents the training and test sets of a user in119863 Nextto that for each user 119880119894 in D where i=12 M and119872 is thenumber of users in D we did the following steps we split thedata of 119880119894 into two independent sets 119879119894 and 119885119894 which are thetraining and test sets of the ith user in D respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 Furthermorewe also moved each block in the training and test sets of theuser 119880119894 to a separate text file This means that each of thetraining and test sets of the user 119880119894 consists of a specifiednumber of text files in which each text file contains one blockof UNIX commands After that we constructed the usedCNN model The constructed CNN model is trained on 119879119894and then tested on 119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 8 depicts theflowchart of the methodology of CNN-experiments

6 Results and Discussion

We carried out three major empirical experiments whichareDNN-experiments LSTM-RNN-experiments andCNN-experiments Each of them consists of six separate subex-periments where each subexperiment is performed on oneof the data configurations SEA SEA 1v49 Greenberg Trun-cated Greenberg Enriched PU Truncated and PU Enriched

14 Security and Communication Networks

YesNo

Start

(1) Input

(2) Prepare text files of D

(4) Split data of Ui

Ti and Zi text sets(6) Construct the used CNN model

(7) Train CNN model on Ti

(8) Test CNN model on Zi

(13) Output TP FP TN and FNEnd

(5) Move each block in Ti and Zi to a separate text file

Data configuration D M

(3) ilarr1

(12) Compute and save TPFP TN and FN for D

(9) Obtain and save TPi FPi TNiand FNi for the user Ui

(11) Is i gt M

(10) ilarri+1

into

Figure 8 The flowchart of the CNN-experiments

Table 6 The confusion matrix of the masquerade detection out-comes

Actual Class Predicted ClassNormal User Masquerader

Normal User TN FPMasquerader FN TP

Basically our PSO-based DNN hyperparameters selectionalgorithmwas implemented in Python 364 [57]withNumPy[58] Moreover all models (DNN LSTM-RNN CNN) wereconstructed and trained and tested based on Keras [59 60]with TensorFlow 16 [61 62] that backend over CUDA 90[63] and cuDNN 70 [64] In addition to that all experimentswere performed on a workstation with an Intel Core i7 CPU(38GHz 16 MB Cache) 16GB of RAM and theWindows 10operating system In order to accelerate the computations inall experiments we also used a GPU-accelerated computingwith NVIDIA Tesla K20 GPU 5GB GDDR5The experimen-tal environment is processed in 64-bit mode

In any classification task we have four possible outcomesTrue Positive (TP) True Negative (TN) False Positive (FP)and False Negative (FN) We get a TP when a masqueraderis correctly classified as a masquerader Whenever a gooduser is correctly classified as a good user itself we say it isa TN A FP occurs when a good user is misclassified as amasquerader In contrast FN occurs when a masqueraderis misclassified as a good user Table 6 shows the ConfusionMatrix of the masquerade detection outcomes For eachdata configuration we used the obtained outcomes for thatdata configuration to compute twelve well-known evaluationmetrics After that by using these evaluation metrics weassessed the performance of each deep learningmodel on thatdata configuration

For simplicity we divided these evaluation metrics intotwo categories General Classification Measures and Mas-querade Detection Measures The General ClassificationMeasures are metrics that are used for any classification tasknamely Accuracy Precision Recall and F1-Score On theother handMasquerade DetectionMeasures are metrics thatusually are used for a masquerade or intrusion detection

task which are Hit Rate Miss Rate False Alarm RateCost Bayesian Detection Rate Bayesian True Negative RateGeometric Mean andMatthews Correlation CoefficientTheused evaluation metrics definition and their correspondingequations are as follows

(i) Accuracy shows the rate of true detection over all testsets

119860119888119888119906119903119886119888119910 = 119879119875 + 119879119873119879119875 + 119879119873 + 119865119875 + 119865119873 (7)

(ii) Precision shows the rate of correctly classified mas-queraders from all blocks in the test set that areclassified as masqueraders

119875119903119890119888119894119904119894119900119899 = 119879119875119879119875 + 119865119875 (8)

(iii) Recall shows the rate of correctly classified masquer-aders over all masquerader blocks in the test set

119877119890119888119886119897119897 = 119879119875119879119875 + 119865119873 (9)

(iv) F1-Score gives information about the accuracy of aclassifier regarding both Precision (P) and Recall (R)metrics

1198651 119878119888119900119903119890 = 21119875 + 1119877 (10)

(v) Hit Rate shows the rate of correctly classified mas-querader blocks over all masquerader blocks pre-sented in the test set It is also called Hits TruePositive Rate or Detection Rate

119867119894119905 119877119886119905119890 = 119879119875119879119875 + 119865119873 (11)

(vi) Miss Rate is the complement of Hit Rate (Miss=100-Hit) ie it shows the rate of masquerade blocksthat are misclassified as a normal user from allmasquerade blocks in the test set It is also calledMisses or False Negative Rate

119872119894119904119904 119877119886119905119890 = 119865119873119865119873 + 119879119875 (12)

Security and Communication Networks 15

(vii) False Alarm Rate (FAR) gives information about therate of normal user blocks that are misclassified as amasquerader over all normal user blocks presented inthe test set It is also called False Positive Rate

119865119886119897119904119890 119860119897119886119903119898 119877119886119905119890 = 119865119875119865119875 + 119879119873 (13)

(viii) Cost is a metric that was proposed in [9] to evaluatethe efficiency of a classifier concerning bothMiss Rate(MR) and False Alarm Rate (FAR) metrics

119862119900119904119905 = 119872119877 + 6 times 119865119860119877 (14)

(ix) Bayesian Detection Rate (BDR) is a metric basedon Base-Rate Fallacy problem which is addressedby S Axelsson in 1999 [65] Base-Rate Fallacy is abasis of Bayesian statistics and occurs when peo-ple do not take the basic rate of incidence (Base-Rate) into their account when solving problems inprobabilities Unlike Hit Rate metric BDR shows therate of correctly classified masquerader blocks overall test set taking into consideration the base-rate ofmasqueraders Let I and Ilowast denote a masquerade anda normal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectivelyThen BDR can be computed asthe probability P(I | A) according to (15) [65]119861119886119910119890119904119894119886119899 119863119890119905119890119888119905119894119900119899 119877119886119905119890 = 119875 (119868 | 119860)

= 119875 (119868) times 119875 (119860 | 119868)119875 (119868) times 119875 (119860 | 119868) + 119875 (119868lowast) times 119875 (119860 | 119868lowast)(15)

P(I) is the rate of the masquerader blocks in the testset P(A | I) is the Hit Rate P(Ilowast) is the rate of thenormal blocks in the test set and P(A | Ilowast) is the FAR

(x) Bayesian True Negative Rate (BTNR) is also basedon Base-Rate Fallacy and shows the rate of trulyclassified normal blocks over all test set in which thepredicted normal behavior indicates really a normaluser [65] Let I and Ilowast denote a masquerade and anormal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectively Then BTNR can be computedas the probability P(Ilowast | Alowast) according to (16) [65]

119861119886119910119890119904119894119886119899 119879119903119906119890 119873119890119892119886119905119894V119890 119877119886119905119890 = 119875 (119868lowast | 119860lowast)= 119875 (119868lowast) times 119875 (119860lowast | 119868lowast)

119875 (119868lowast) times 119875 (119860lowast | 119868lowast) + 119875 (119868) times 119875 (119860lowast | 119868)(16)

P(Ilowast) is the rate of the normal blocks in the test setP(Alowast | Ilowast) is the True Negative Rate which is easilyobtained by calculating (1-FAR) P(I) is the rate of themasquerader blocks in the test set and P(Alowast | I) isthe Miss Rate

(xi) Geometric Mean (g-mean) is a performance metricthat combines true negative rate and true positive

rate at one specific threshold where both the errorsare considered equal This metric has been usedby several researchers for evaluating classifiers onimbalance dataset [66] It can be computed accordingto (17) [67]

119892 119898119890119886119899 = radic 119879119875 times 119879119873(119879119875 + 119865119873) times (119879119873 + 119865119875) (17)

(xii) Matthews Correlation Coefficient (MCC) is a perfor-mance metric that takes into account true and falsepositives and negatives and is generally regarded asa balanced measure which can be used even if theclasses are of very different sizes (imbalance dataset)[68] MCC has a range of minus1 to 1 where minus1 indicates acompletely wrong binary classifier while 1 indicates acompletely correct binary classifier Unlike the othermetrics discussed aboveMCC takes all the cells of theConfusion Matrix into consideration in its formulawhich can be computed according to (18) [69]

119872119862119862= (119879119875 times 119879119873) minus (119865119875 times 119865119873)radic(119879119875 + 119865119873) times (119879119875 + 119865119875) times (119879119873 + 119865119875) times (119879119873 + 119865119873)

(18)

In the following two subsections we will present our experi-mental results and explain them using two kinds of analysesperformance analysis and ROC curves analysis

61 Performance Analysis The effectiveness of any modelto detect masqueraders depends on its values of evaluationmetrics The higher values of Accuracy Precision RecallF1-Score Hit Rate Bayesian Detection Rate Bayesian TrueNegative Rate Geometric Mean and Matthews CorrelationCoefficient as well as the lower values of Miss Rate FalseAlarm Rate and Cost indicate an efficient classifierThe idealclassifier hasAccuracy andHit Rate values that reach 1 as wellasMiss Rate and False AlarmRate values that reach 0 Table 7presents the percentages of the used evaluation metricsfor DNN-experiments LSTM-RNN-experiments and CNN-experiments Actually the rows labeled by DNN and LSTM-RNN in Table 7 show results of the static masquerade detec-tion by using DNN and LSTM-RNN models respectivelywhereas the rows labeled by CNN in Table 7 show resultsof the dynamic masquerade detection by using CNN modelFurthermore the bold rows represent the best results amongthe same data configuration whereas the underlined valuesare the best for all data configurations

First of all the impact of using our PSO-based algorithmcan be seen in the obtained results of both DNN and LSTM-RNN models The PSO-based algorithm is used to optimizethe selection of DNN hyperparameters that maximized theaccuracy which means that the sum of TP and TN outcomeswill be increased significantly Thus according to (11) and(13) increasing the sum of TP and TN will lead definitelyto the increase of the value of Hit as well as to the decreaseof the value of FAR Although the accuracy values of SEA1v49 data configuration for all models are slightly lower than

16 Security and Communication Networks

Table 7 The results of our experiments

Dataset DataConfiguration Model Evaluation Metrics ()

Accuracy Precision Recall F1-Score Hit Miss FAR Cost BDR BTNR g-mean MCC

SEA Dataset

SEADNN 9808 7626 8485 8033 8485 1515 128 2283 7625 9926 9152 7945

LSTM-RNN 9852 8230 8658 8439 8658 1342 090 1883 8233 9934 9263 8364CNN 9884 8777 8701 8739 8701 1299 059 1651 8772 9937 93 8678

SEA 1v49DNN 9654 9998 9643 9817 9643 357 048 647 9998 5204 9796 7064

LSTM-RNN 9786 9998 9779 9887 9779 221 038 448 9998 6370 987 7874CNN 9878 9999 9874 9936 9874 126 019 240 9999 7551 9927 8622

GreenbergDataset

GreenbergTruncated

DNN 9397 9223 8067 8606 8067 1933 204 3157 9222 9441 8889 8253LSTM-RNN 9472 9488 8153 8770 8153 1847 132 2639 9487 9468 897 8476

CNN 9543 9616 8353 8940 8353 1647 10 2247 9616 9524 9094 8686

GreenbergEnriched

DNN 9757 9692 9240 9461 9240 760 088 1288 9692 9775 957 9308LSTM-RNN 9798 9757 9360 9554 9360 640 070 1060 9756 9810 9641 9428

CNN 9860 9855 9533 9692 9533 467 042 719 9855 9861 9743 9603

PU Dataset

PU TruncatedDNN 810 9959 7861 8786 7861 2139 225 3489 9959 3949 8766 5463

LSTM-RNN 8219 9969 7989 8870 7989 2011 175 3061 9968 4110 886 5646CNN 8375 9974 8164 8979 8164 1836 150 2736 9973 4338 8968 5879

PU EnrichedDNN 9044 9984 8921 9423 8921 1079 10 1679 9984 5672 9398 7064

LSTM-RNN 9131 9988 9018 9478 9018 982 075 1432 9988 5908 9461 7261CNN 9375 9992 9293 9630 9293 707 050 1007 9992 6678 9616 7852

the corresponding values of SEA data configuration also Hitvalues are dramatically increased in SEA 1v49 for all modelsby 10-14 from those that are in the SEA data configurationThis is due to the structure of SEA 1v49 data configurationwhere there are 122500 masquerader blocks in the test setof SEA 1v49 comparing to only 231 blocks in the SEA dataconfiguration Moreover the FAR values of SEA 1v49 for allmodels are significantly lower than the corresponding valuesof SEA data configuration Hence regarding SEA datasetSEA 1v49 is better to use in masquerade detection than SEAdata configuration

On the other hand as we expected Greenberg Enrichedenhanced noticeably the performance of all models in termsof all used evaluation metrics from the corresponding val-ues of Greenberg Truncated data configuration This canbe explained by the fact that Greenberg Enriched dataconfiguration has more information about user behaviorincluding command name parameters aliases and flagscomparing to only command name in Greenberg TruncatedTherefore regarding Greenberg dataset Greenberg Enricheddata configuration is better to use in masquerade detectionthan Greenberg Truncated The same thing happened inPU dataset where its PU Enriched data configuration hasbetter results regarding all models than PU Truncated Thusregarding PU dataset PU Enriched is better to use inmasquerade detection than PUTruncated data configuration

Actually PU Truncated and Greenberg Truncated dataconfigurations simulate SEA and SEA 1v49 data configu-rations where only command name is considered Despitethat regarding all used models SEA 1v49 recorded thebest results among the other truncated data configurationsOn the other hand PU Enriched and Greenberg Enriched

are considered as enriched data configurations where extrainformation about users is taken into consideration Due tothat enriched data configurations help models to build userrsquosbehavior profile more accurately than with truncated dataconfigurations Regarding all models the results associatedwithGreenberg Enriched especially in terms ofAccuracyHitand FAR values are better than of the corresponding valuesof PU Enriched data configuration because PU dataset isvery small masquerade detection dataset with a relatively lownumber of users (only 8 users) Also this reason can explainwhy a few previous works used PU dataset in masqueradedetection However data configurations can be sort for allused models from the upper to lower according to theobtained results as follows SEA 1v49 Greenberg EnrichedPU Enriched SEA Greenberg Truncated and PUTruncated

For the sake of brevity and space limitation we selected asubset of the used performancemetrics inTable 7 to be shownvisually in Figures 9 and 10 Figures 9(a) 9(b) 9(c) 9(d)9(e) 9(f) 9(g) and 9(h) showAccuracy HitMiss FAR CostBDR F1-Score and MCC percentages of the used modelsin each data configuration respectively Figures 10(a) 10(b)10(c) 10(d) 10(e) and 10(f) show Accuracy Hit FAR BDRF1-Score and MCC percentages for the average performanceof the used models on datasets respectively Figures 9 and10 can give us a visual comparison of the performance of theused deep learning models for each data configuration anddataset as well as in all datasets

By taking an inspective look to Figures 9 and 10 we cannotice the stability of deep learning models in such a waythat they are enhancing masquerade detection from a dataconfiguration to another in a consistent pattern To explainthat we will discuss the obtained results from the perspective

Security and Communication Networks 17

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnriched

PU EnrichedPU Truncated

0102030405060708090

100

Accura

cy (

)

(a)

0102030405060708090

100

Hit

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(b)

0

5

10

15

20

25

Miss

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(c)

002040608

112141618

22224

FAR

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(d)

0

5

10

15

20

25

30

35

Cos

t (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(e)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU Truncated

0102030405060708090

100

BDR

()

PU Enriched

(f)

Figure 9 Continued

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 10: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

10 Security and Communication Networks

Table 5 The used DNN hyperparameters and their domains

Hyperparameter Domain DescriptionLearning rate [001 09] ContinuousMomentum [01 09] ContinuousDecay [0001 001 ContinuousDropout rate [01 09] ContinuousNumber of hidden layers [1 10] Discrete with step=1Numbers of neurons of hidden layers [1 100] Discrete with step=1Number of epochs [5 20] Discrete with step=5Batch size [100 1000] Discrete with step=50Optimizer [1 6] Discrete with step=1Initialization function [1 8] Discrete with step=1Layer type [1 2] Discrete with step=1Activation function [1 8] Discrete with step=1

RMSprop and SGD Layer type list contains two elementswhich are Dropout and Dense Initialization function listincludes elements Zero Normal Lecun uniform UniformGlorot uniform Glorot normal He uniform and He normalFinally Activation list has eight elements which are LinearSoftmax ReLU Sigmoid Tanh Hard Sigmoid Softsign andSoftplus It is worth mentioning that the elements of all cate-gorical hyperparameters are defined inKeras implementation[30]

The optimization and results extraction stages will beperformed once for each user in the particular data configu-ration that is they will be repeated for each user119880119894 i=12 M where 119872 is the number of users in the particular dataconfiguration119863The optimization stage starts by splitting thedata of the user119880119894 into two independent sets119879119894 and119885119894 whichare the training and test sets of the ith user respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 All blocks ofthe training and test sets are converted from text to numericvalues and then are normalized in [0 1] After that wesupplied these sets to the proposed PSO-based algorithm tofind the optimized hyperparameters vector119867119894 for the ith userIn addition to that we will save a copy of 119867119894 values in adatabase in order to save time and use them again in theRNN-experiment of that particular data configuration D aswill be presented in Section 512 The results extraction stagetakes place when constructing the DNN that is tuned by 119867119894trains the DNN on 119879119894 and tests the DNN on119885119894 The values ofthe classification outcomes True Positive (TP119894) False Positive(FP119894) True Negative (TN 119894) and False Negative (FN 119894) for theith user in the particular data configuration 119863 are extractedand saved for further processing later

Then the next user is observed and same procedure ofoptimization and results extraction stages is performed tillthe last user in the particular data configuration119863 is reachedFinally when all users in the particular data configurationare completed the last stage (finishing stage) is executedFinishing stage computes the summation of all obtained TPsof all users in the particular data configuration 119863 denotedby TP The same process will be applied also to the otheroutcomes namely FP TN and FN Equations (3) (4)

(5) and (6) express the formulas of TP FP TN and FNrespectively

119879119875 = 119872sum119894=1

119879119875119894 (3)

119865119875 = 119872sum119894=1

119865119875119894 (4)

119879119873 = 119872sum119894=1

119879119873119894 (5)

119865119873 = 119872sum119894=1

119865119873119894 (6)

The finishing stage will report and save these outcomes andend the DNN-experiment for the particular data configura-tion 119863 The former outcomes will be used to compute tenwell-known evaluation metrics to assess the performanceof the DNN on the particular data configuration D as willbe presented in Section 6 It is worth saying that the sameprocedure which is explained above will be done for eachdata configuration Figure 4 depicts the flowchart of themethodology of the DNN-experiments

512 Recurrent Neural Networks TheRecurrent Neural Net-work is a special type of the traditional feed-forwardArtificialNeural Network Unlike traditional ANN in the RNN eachneuron in any of the hidden layers has additional connectionsfrom its output to itself (self-recurrent) as well as to otherneurons of the same hidden layer Therefore the output ofthe RNNrsquos hidden layer at any time step (t) is for the currentinputs and the output of the hidden layer at the previous timestep (t-1) In RNN these directed cycles allow informationto circulate in the network and make the hidden layers asthe storage unit of the whole network [41] The importantcharacteristics of the RNN are the capability to have memoryand generate periodical sequences

Despite that the conventional RNN structure which isdescribed above has a serious problem especially when the

Security and Communication Networks 11

(9) Construct DNN that is tuned by Hi

(10) Train DNN on Ti

(11) Test DNN on Zi

No

(16) OutputTP FP TN and FN

Yes

End

Start

(1) Input Data configuration D M

(2) Set PSO parameters values

(3) Define Domains for Hyper-parameters

(3) Results Extraction Stage (4) Finishing Stage(1) Initialization Stage (2) Optimization Stage

(6) Execute the proposed PSO-based algorithm

(15) Compute and save TP FP TN and FN for D

(8) Database

(4) ilarr1

(7) Obtain Hi of the user Ui

(5) Create Ti and Zi sets of the user Ui

Hi

(12) Obtain and save TPi FPi TNi andFNi for the user Ui

(14) Is i gt M

(13) ilarri+1

Figure 4 The flowchart of the DNN-experiments

Inputxt

it ctℎt

Outputot

ft

Figure 5 The structure of an LSTM cell [6]

RNN is trained using the back-propagation technique Theproblem is known as gradient vanishing and exploding [42]The gradient vanishing problem occurs when the gradientsignal gets so small over the network which causes learningto become very slow or stop On the other hand the gradientexploding problem occurs when the gradient signal gets solarge in which learning divergesThis problem of the conven-tional RNN limited the use of the RNN to be only in short-termmemory tasks To solve this problem a new architectureof RNN is proposed by Hochreiter and Schmidhuber [43]known as Long Short-Term Memory (LSTM) LSTM uses anew structure called a memory cell that is composed of fourparts which are an input gate a neuron with a self-recurrentconnection a forget gate and the output gateMeanwhile themain goal of using a neuron with a self-recurrent connectionis to record information the aim of using three gates is tocontrol the flow of information from or into the memory cellThe input gate decides if to allow the incoming informationto enter into the memory cell or block it Moreover the forgetgate controls if to pass the previous state of the memory cellto alter the current state of the memory cell or prevent itFinally the output gate determines if to pass the output ofthe memory cell or not Figure 5 shows the structure of anLSTM memory cell Rather than overcoming the problemsof the conventional RNN LSTM model also outperformsthe conventional RNN in terms of performance especially inlong-term memory tasks [5] The LSTM-RNN model can beobtained by replacing every neuron in the hidden layers ofthe RNN to an LSTMmemory cell [6]

In this study we used the LSTM-RNN model to performa static masquerade detection task on all data configurationsAs mentioned in Section 511 there are six data config-urations and each of them will be used in the separate

experiment So we will have six separate LSTM-RNN-experiments each experiment will be on one of the dataconfigurations The methodology of all of these experimentsis the same and as follows for the given data configurationD we firstly prepared all the given data configurationrsquos filesby converting all blocks from text to numerical values andthen normalizing them in [0 1] Next to that for each user119880119894 in D where i=12 M and 119872 is the number of users inD we did the following steps we split the data of 119880119894 into twoindependent sets 119879119894 and 119885119894 which are the training and testsets of the ith user in D respectively The splitting processfollowed the structure of the particular data configurationwhich is described in Section 3 After that we retrieved thestored optimized hyperparameters vector of the ith user (119867119894)from the database which is created in the previous DNN-experiments Then we constructed the RNN model that istuned by119867119894 In order to obtain the LSTM-RNNmodel everyneuron in any of the hidden layers is replaced to an LSTMmemory cell The constructed LSTM-RNN model is trainedon119879119894 and then tested on119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 6 depicts theflowchart of the methodology of LSTM-RNN-experiments

52 Dynamic Classification Approach In contrast of staticclassification approach dynamic classification approach doesnot need a ready-to-use dataset with static features [30] Itcovenants directly with raw data sources such as text imagevideo sound and signal files and extracts features from themdynamically The models that use this approach try to learnand represent features in unsupervised manner Then thesemodels train themselves using the extracted features to beable to classify unseen dataThe deep learningmodels fit verywell for this approach because the main objectives of deeplearning models are the strong ability of automatic featureextraction and self-learning Rather than that dynamicclassification models overcome the problem of the lake ofdatasets it performs more efficient than the static classifica-tionmodels Despite these advantages dynamic classificationapproach has also drawbacks Dynamic classification modelsare slower and take a long time to train if compared with

12 Security and Communication Networks

YesNo

Hi

Start

(1) InputData configuration D M

(2) Prepare files of D

(4) Split data of Ui

into Ti and Zi sets

(7) Train LSTM-RNN model on Ti

(8) Test LSTM-RNN model on Zi

End

(5) Database

(6) Construct LSTM-RNN model that is tuned by Hi

(3) ilarr1

(9) Obtain and save TPi FPi TNi andFNi for the user Ui

(10) ilarri+1

(11) Is i gt M

(13) Output TPFP TN and FN

(12) Compute andsave TP FP TN

and FN for D

Figure 6 The flowchart of the LSTM-RNN-experiments

static classification models due to complex deep structure ofthesemodels as well as the huge amount of computations thatare required to execute Furthermore dynamic classificationmodels require a very large amount of input samples to gainhigh accuracy values

In this research we used six data configurations that areimplemented from three textual datasets In order to applydynamic masquerade detection on these data configurationswe need amodel that is able to extract features from the userrsquoscommand text file dynamically and then classify the user intoone of the two classes that will be either a normal user or amasqueraderTherefore we dealwith a text classification taskThe text classification is defined as a task that assigns a pieceof text (a word a sentence or even a document) to one ormore classes according to its content Indeed there are threetypes of text classification namely sentence classificationsentiment analysis and document categorization In sentenceclassification a given sentence should be assigned correctlyto one of possible classes Furthermore sentiment analysisdetermines if a given sentence is a positive negative orneutral towards a specific subject In contrast documentcategorization deals with documents and determines whichclass from a given set of possible classes a document belongsto According to the nature of dynamic classification as well asthe functionality of text classification deep learning modelsare the fittest among the other machine learning models forthese types of classification due to their powerful capability offeatures learning

A wide range of researches have been accomplished inthe literature in the field of text classification using deeplearning models It was started by LeCun et al in 1998 whenthey proposed a special topology of the Convolutional NeuralNetwork (CNN) known as LeNet family and used it in textclassification efficiently [44]Then various studies have beenpublished to introduce text classification algorithms as wellas the factors that impact the performance [45ndash47] In thestudy [48] the CNNmodel is used for sentence classificationtask over a set of text dataset benchmarks A single one-dimensional CNN is proposed to learn a region-based textembedding [49] X Zhang et al introduced a novel character-based multidimensional CNN for text classification taskswith competitive results [50] In the research [51] a newhierarchal approach calledHierarchal Deep Learning for Text

classification (HDLTex) is proposed and three deep struc-tures which are DNN RNN and CNN are used A recurrentconvolutional network model is introduced [52] for textclassification and high results are obtained on documents-level datasets A novel LSTM-based model is introduced andused for text classification withmultitask learning framework[53] The study [54] proposed a new model called hierarchalattention network for document classification and is testedon six large document-level datasets with good results Acharacter-level text representations approach is proposed andtested for text classification tasks using deep CNN [55]As noticed the CNN is the mostly used deep learningmodel for text classification tasks So we decided to use theCNN to perform dynamic masquerade detection on all dataconfigurations The following subsection reviews the CNNand explains the structure of the used CNN model and themethodology of our CNN-experiments

521 Convolutional Neural Networks The ConvolutionalNeural Network (CNN) is a deep learning model whichis biological-inspired from the animal visual cortex TheCNN can be considered as a special type of the traditionalfeed-forwardArtificial Neural NetworkThemajor differencebetween ANN and CNN is that instead of the fully connectedarchitecture of ANN the individual neurons in CNN areconnected to subregions of the input field The neurons ofthe CNN are arranged in such a way they are tilled to coverthe entire input field The typical CNN consists of five maincomponents namely an input layer the convolutional layerthe pooling layer the fully connected layer and an outputlayer The input layer is where the input data is enteredinto the CNN The first convolutional layer in the CNNconsists of individual neurons that each of them is connectedto a small subset of the input field The neurons in thenext convolutional layers connect only to a subset of theirpreceding pooling layerrsquos outputMoreover the convolutionallayers in the CNN use a set of learnable kernels or filters thateach filter is applied to the specified subset of their precedinglayerrsquos output These filters calculate feature maps in whicheach feature map shares the same weights The poolinglayer also known as a subsampling layer is a nonlineardownsampling function that condenses subsets of its inputThemain goal of using pooling layers in the CNN is to reduce

Security and Communication Networks 13

Userrsquos Command Text Files

Quantization

Input Layer

Convolutional layer

C1 features map P1 features map

Max-Pooling layer

C2 P2 C6 P6

Fully-Connected dropout layers

2048 sigmoid neurons

2048 sigmoid neurons 2

softmaxneurons

Outputdense layer

0 (Normal)1 (Masquerader)

Figure 7 The architecture of the used CNNmodel

the complexity and computations by reducing the size of theirpreceding layerrsquos output There are many pooling nonlinearfunctions that can be used but among them max-poolingis the mostly used which selects the maximum value in thegiven pooling window Typically each convolutional layer inthe CNN is followed by a max-pooling layer The CNN hasone or more stacked convolutional layer and max-poolinglayer pairs to extract features from the entire input and thenmap these features to their next fully connected layerThe toplayers of the CNN are one or more of fully connected layerswhich are similar to hidden layers in the DNN This meansthat neurons of the fully connected layers are connected to allneurons of the preceding layer The output layer is the finallayer in the CNN and is responsible for reporting the outputvalue of the CNN Finally the back-propagation algorithm isusually used to train CNNs via Stochastic Gradient Decent(SGD) to adjust the weights of the fully connected layers [56]There are several variant structures of CNN that are proposedin the literature but LeNet structure which is proposed byLeCun et al [44] is themost common approach used inmanyapplications of computer vision and text classification

Regarding its stability and high efficiency in text clas-sification we selected the CNN model which is proposedin [50] to perform a dynamic masquerade detection on alldata configurationsThe usedmodel is a character-level CNNthat takes a text file as input and outputs the classificationscore (0 if the input text file is related to a normal user or1 otherwise) The used CNN model is from LeNet familyand consists of an input layer followed by six convolutionand max-pooling pairs followed by two fully connectedlayers and finally followed by an output layer In the inputlayer the text quantization process takes place when theused model encodes all letters in the input text file using aone-hot representation from a 70-character alphabet All theconvolutional layers in the used CNN model have a ReLUnonlinear activation functionThe two fully connected layersin the used CNN model are of the type dropout layer withdropout probability equal to 05 In addition to that the twofully connected layers in the usedCNNmodel have a Sigmoidnonlinear activation function as well as they have the samesize of 2048 neurons of each The output layer in the usedCNN model is of the type dense layer as well as it has asoftmax activation function and size of two neurons Theused CNN model is trained by back-propagation algorithmvia SGD Finally we set the following parameters to the

used CNN model learning rate=001 epochs=30 and batchsize=64 These values are obtained experimentally by per-forming a grid search to find the best possible values of theseparameters Figure 7 shows the architecture of the used CNNmodel and is reproduced from Zhang et al (2015) [under theCreative Commons Attribution Licensepublic domain]

In our work we used a CNNmodel to perform a dynamicmasquerade detection task on all data configurations Asmentioned in Section 511 there are six data configurationsand each of them will be used in the separate experimentSo we will have six separate CNN-experiments and eachexperiment will be on one of the data configurations Themethodology of all of these experiments is the same and asfollows for the given data configurationD we firstly preparedall the given data configurationrsquos text files such that each file ofthem represents the training and test sets of a user in119863 Nextto that for each user 119880119894 in D where i=12 M and119872 is thenumber of users in D we did the following steps we split thedata of 119880119894 into two independent sets 119879119894 and 119885119894 which are thetraining and test sets of the ith user in D respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 Furthermorewe also moved each block in the training and test sets of theuser 119880119894 to a separate text file This means that each of thetraining and test sets of the user 119880119894 consists of a specifiednumber of text files in which each text file contains one blockof UNIX commands After that we constructed the usedCNN model The constructed CNN model is trained on 119879119894and then tested on 119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 8 depicts theflowchart of the methodology of CNN-experiments

6 Results and Discussion

We carried out three major empirical experiments whichareDNN-experiments LSTM-RNN-experiments andCNN-experiments Each of them consists of six separate subex-periments where each subexperiment is performed on oneof the data configurations SEA SEA 1v49 Greenberg Trun-cated Greenberg Enriched PU Truncated and PU Enriched

14 Security and Communication Networks

YesNo

Start

(1) Input

(2) Prepare text files of D

(4) Split data of Ui

Ti and Zi text sets(6) Construct the used CNN model

(7) Train CNN model on Ti

(8) Test CNN model on Zi

(13) Output TP FP TN and FNEnd

(5) Move each block in Ti and Zi to a separate text file

Data configuration D M

(3) ilarr1

(12) Compute and save TPFP TN and FN for D

(9) Obtain and save TPi FPi TNiand FNi for the user Ui

(11) Is i gt M

(10) ilarri+1

into

Figure 8 The flowchart of the CNN-experiments

Table 6 The confusion matrix of the masquerade detection out-comes

Actual Class Predicted ClassNormal User Masquerader

Normal User TN FPMasquerader FN TP

Basically our PSO-based DNN hyperparameters selectionalgorithmwas implemented in Python 364 [57]withNumPy[58] Moreover all models (DNN LSTM-RNN CNN) wereconstructed and trained and tested based on Keras [59 60]with TensorFlow 16 [61 62] that backend over CUDA 90[63] and cuDNN 70 [64] In addition to that all experimentswere performed on a workstation with an Intel Core i7 CPU(38GHz 16 MB Cache) 16GB of RAM and theWindows 10operating system In order to accelerate the computations inall experiments we also used a GPU-accelerated computingwith NVIDIA Tesla K20 GPU 5GB GDDR5The experimen-tal environment is processed in 64-bit mode

In any classification task we have four possible outcomesTrue Positive (TP) True Negative (TN) False Positive (FP)and False Negative (FN) We get a TP when a masqueraderis correctly classified as a masquerader Whenever a gooduser is correctly classified as a good user itself we say it isa TN A FP occurs when a good user is misclassified as amasquerader In contrast FN occurs when a masqueraderis misclassified as a good user Table 6 shows the ConfusionMatrix of the masquerade detection outcomes For eachdata configuration we used the obtained outcomes for thatdata configuration to compute twelve well-known evaluationmetrics After that by using these evaluation metrics weassessed the performance of each deep learningmodel on thatdata configuration

For simplicity we divided these evaluation metrics intotwo categories General Classification Measures and Mas-querade Detection Measures The General ClassificationMeasures are metrics that are used for any classification tasknamely Accuracy Precision Recall and F1-Score On theother handMasquerade DetectionMeasures are metrics thatusually are used for a masquerade or intrusion detection

task which are Hit Rate Miss Rate False Alarm RateCost Bayesian Detection Rate Bayesian True Negative RateGeometric Mean andMatthews Correlation CoefficientTheused evaluation metrics definition and their correspondingequations are as follows

(i) Accuracy shows the rate of true detection over all testsets

119860119888119888119906119903119886119888119910 = 119879119875 + 119879119873119879119875 + 119879119873 + 119865119875 + 119865119873 (7)

(ii) Precision shows the rate of correctly classified mas-queraders from all blocks in the test set that areclassified as masqueraders

119875119903119890119888119894119904119894119900119899 = 119879119875119879119875 + 119865119875 (8)

(iii) Recall shows the rate of correctly classified masquer-aders over all masquerader blocks in the test set

119877119890119888119886119897119897 = 119879119875119879119875 + 119865119873 (9)

(iv) F1-Score gives information about the accuracy of aclassifier regarding both Precision (P) and Recall (R)metrics

1198651 119878119888119900119903119890 = 21119875 + 1119877 (10)

(v) Hit Rate shows the rate of correctly classified mas-querader blocks over all masquerader blocks pre-sented in the test set It is also called Hits TruePositive Rate or Detection Rate

119867119894119905 119877119886119905119890 = 119879119875119879119875 + 119865119873 (11)

(vi) Miss Rate is the complement of Hit Rate (Miss=100-Hit) ie it shows the rate of masquerade blocksthat are misclassified as a normal user from allmasquerade blocks in the test set It is also calledMisses or False Negative Rate

119872119894119904119904 119877119886119905119890 = 119865119873119865119873 + 119879119875 (12)

Security and Communication Networks 15

(vii) False Alarm Rate (FAR) gives information about therate of normal user blocks that are misclassified as amasquerader over all normal user blocks presented inthe test set It is also called False Positive Rate

119865119886119897119904119890 119860119897119886119903119898 119877119886119905119890 = 119865119875119865119875 + 119879119873 (13)

(viii) Cost is a metric that was proposed in [9] to evaluatethe efficiency of a classifier concerning bothMiss Rate(MR) and False Alarm Rate (FAR) metrics

119862119900119904119905 = 119872119877 + 6 times 119865119860119877 (14)

(ix) Bayesian Detection Rate (BDR) is a metric basedon Base-Rate Fallacy problem which is addressedby S Axelsson in 1999 [65] Base-Rate Fallacy is abasis of Bayesian statistics and occurs when peo-ple do not take the basic rate of incidence (Base-Rate) into their account when solving problems inprobabilities Unlike Hit Rate metric BDR shows therate of correctly classified masquerader blocks overall test set taking into consideration the base-rate ofmasqueraders Let I and Ilowast denote a masquerade anda normal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectivelyThen BDR can be computed asthe probability P(I | A) according to (15) [65]119861119886119910119890119904119894119886119899 119863119890119905119890119888119905119894119900119899 119877119886119905119890 = 119875 (119868 | 119860)

= 119875 (119868) times 119875 (119860 | 119868)119875 (119868) times 119875 (119860 | 119868) + 119875 (119868lowast) times 119875 (119860 | 119868lowast)(15)

P(I) is the rate of the masquerader blocks in the testset P(A | I) is the Hit Rate P(Ilowast) is the rate of thenormal blocks in the test set and P(A | Ilowast) is the FAR

(x) Bayesian True Negative Rate (BTNR) is also basedon Base-Rate Fallacy and shows the rate of trulyclassified normal blocks over all test set in which thepredicted normal behavior indicates really a normaluser [65] Let I and Ilowast denote a masquerade and anormal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectively Then BTNR can be computedas the probability P(Ilowast | Alowast) according to (16) [65]

119861119886119910119890119904119894119886119899 119879119903119906119890 119873119890119892119886119905119894V119890 119877119886119905119890 = 119875 (119868lowast | 119860lowast)= 119875 (119868lowast) times 119875 (119860lowast | 119868lowast)

119875 (119868lowast) times 119875 (119860lowast | 119868lowast) + 119875 (119868) times 119875 (119860lowast | 119868)(16)

P(Ilowast) is the rate of the normal blocks in the test setP(Alowast | Ilowast) is the True Negative Rate which is easilyobtained by calculating (1-FAR) P(I) is the rate of themasquerader blocks in the test set and P(Alowast | I) isthe Miss Rate

(xi) Geometric Mean (g-mean) is a performance metricthat combines true negative rate and true positive

rate at one specific threshold where both the errorsare considered equal This metric has been usedby several researchers for evaluating classifiers onimbalance dataset [66] It can be computed accordingto (17) [67]

119892 119898119890119886119899 = radic 119879119875 times 119879119873(119879119875 + 119865119873) times (119879119873 + 119865119875) (17)

(xii) Matthews Correlation Coefficient (MCC) is a perfor-mance metric that takes into account true and falsepositives and negatives and is generally regarded asa balanced measure which can be used even if theclasses are of very different sizes (imbalance dataset)[68] MCC has a range of minus1 to 1 where minus1 indicates acompletely wrong binary classifier while 1 indicates acompletely correct binary classifier Unlike the othermetrics discussed aboveMCC takes all the cells of theConfusion Matrix into consideration in its formulawhich can be computed according to (18) [69]

119872119862119862= (119879119875 times 119879119873) minus (119865119875 times 119865119873)radic(119879119875 + 119865119873) times (119879119875 + 119865119875) times (119879119873 + 119865119875) times (119879119873 + 119865119873)

(18)

In the following two subsections we will present our experi-mental results and explain them using two kinds of analysesperformance analysis and ROC curves analysis

61 Performance Analysis The effectiveness of any modelto detect masqueraders depends on its values of evaluationmetrics The higher values of Accuracy Precision RecallF1-Score Hit Rate Bayesian Detection Rate Bayesian TrueNegative Rate Geometric Mean and Matthews CorrelationCoefficient as well as the lower values of Miss Rate FalseAlarm Rate and Cost indicate an efficient classifierThe idealclassifier hasAccuracy andHit Rate values that reach 1 as wellasMiss Rate and False AlarmRate values that reach 0 Table 7presents the percentages of the used evaluation metricsfor DNN-experiments LSTM-RNN-experiments and CNN-experiments Actually the rows labeled by DNN and LSTM-RNN in Table 7 show results of the static masquerade detec-tion by using DNN and LSTM-RNN models respectivelywhereas the rows labeled by CNN in Table 7 show resultsof the dynamic masquerade detection by using CNN modelFurthermore the bold rows represent the best results amongthe same data configuration whereas the underlined valuesare the best for all data configurations

First of all the impact of using our PSO-based algorithmcan be seen in the obtained results of both DNN and LSTM-RNN models The PSO-based algorithm is used to optimizethe selection of DNN hyperparameters that maximized theaccuracy which means that the sum of TP and TN outcomeswill be increased significantly Thus according to (11) and(13) increasing the sum of TP and TN will lead definitelyto the increase of the value of Hit as well as to the decreaseof the value of FAR Although the accuracy values of SEA1v49 data configuration for all models are slightly lower than

16 Security and Communication Networks

Table 7 The results of our experiments

Dataset DataConfiguration Model Evaluation Metrics ()

Accuracy Precision Recall F1-Score Hit Miss FAR Cost BDR BTNR g-mean MCC

SEA Dataset

SEADNN 9808 7626 8485 8033 8485 1515 128 2283 7625 9926 9152 7945

LSTM-RNN 9852 8230 8658 8439 8658 1342 090 1883 8233 9934 9263 8364CNN 9884 8777 8701 8739 8701 1299 059 1651 8772 9937 93 8678

SEA 1v49DNN 9654 9998 9643 9817 9643 357 048 647 9998 5204 9796 7064

LSTM-RNN 9786 9998 9779 9887 9779 221 038 448 9998 6370 987 7874CNN 9878 9999 9874 9936 9874 126 019 240 9999 7551 9927 8622

GreenbergDataset

GreenbergTruncated

DNN 9397 9223 8067 8606 8067 1933 204 3157 9222 9441 8889 8253LSTM-RNN 9472 9488 8153 8770 8153 1847 132 2639 9487 9468 897 8476

CNN 9543 9616 8353 8940 8353 1647 10 2247 9616 9524 9094 8686

GreenbergEnriched

DNN 9757 9692 9240 9461 9240 760 088 1288 9692 9775 957 9308LSTM-RNN 9798 9757 9360 9554 9360 640 070 1060 9756 9810 9641 9428

CNN 9860 9855 9533 9692 9533 467 042 719 9855 9861 9743 9603

PU Dataset

PU TruncatedDNN 810 9959 7861 8786 7861 2139 225 3489 9959 3949 8766 5463

LSTM-RNN 8219 9969 7989 8870 7989 2011 175 3061 9968 4110 886 5646CNN 8375 9974 8164 8979 8164 1836 150 2736 9973 4338 8968 5879

PU EnrichedDNN 9044 9984 8921 9423 8921 1079 10 1679 9984 5672 9398 7064

LSTM-RNN 9131 9988 9018 9478 9018 982 075 1432 9988 5908 9461 7261CNN 9375 9992 9293 9630 9293 707 050 1007 9992 6678 9616 7852

the corresponding values of SEA data configuration also Hitvalues are dramatically increased in SEA 1v49 for all modelsby 10-14 from those that are in the SEA data configurationThis is due to the structure of SEA 1v49 data configurationwhere there are 122500 masquerader blocks in the test setof SEA 1v49 comparing to only 231 blocks in the SEA dataconfiguration Moreover the FAR values of SEA 1v49 for allmodels are significantly lower than the corresponding valuesof SEA data configuration Hence regarding SEA datasetSEA 1v49 is better to use in masquerade detection than SEAdata configuration

On the other hand as we expected Greenberg Enrichedenhanced noticeably the performance of all models in termsof all used evaluation metrics from the corresponding val-ues of Greenberg Truncated data configuration This canbe explained by the fact that Greenberg Enriched dataconfiguration has more information about user behaviorincluding command name parameters aliases and flagscomparing to only command name in Greenberg TruncatedTherefore regarding Greenberg dataset Greenberg Enricheddata configuration is better to use in masquerade detectionthan Greenberg Truncated The same thing happened inPU dataset where its PU Enriched data configuration hasbetter results regarding all models than PU Truncated Thusregarding PU dataset PU Enriched is better to use inmasquerade detection than PUTruncated data configuration

Actually PU Truncated and Greenberg Truncated dataconfigurations simulate SEA and SEA 1v49 data configu-rations where only command name is considered Despitethat regarding all used models SEA 1v49 recorded thebest results among the other truncated data configurationsOn the other hand PU Enriched and Greenberg Enriched

are considered as enriched data configurations where extrainformation about users is taken into consideration Due tothat enriched data configurations help models to build userrsquosbehavior profile more accurately than with truncated dataconfigurations Regarding all models the results associatedwithGreenberg Enriched especially in terms ofAccuracyHitand FAR values are better than of the corresponding valuesof PU Enriched data configuration because PU dataset isvery small masquerade detection dataset with a relatively lownumber of users (only 8 users) Also this reason can explainwhy a few previous works used PU dataset in masqueradedetection However data configurations can be sort for allused models from the upper to lower according to theobtained results as follows SEA 1v49 Greenberg EnrichedPU Enriched SEA Greenberg Truncated and PUTruncated

For the sake of brevity and space limitation we selected asubset of the used performancemetrics inTable 7 to be shownvisually in Figures 9 and 10 Figures 9(a) 9(b) 9(c) 9(d)9(e) 9(f) 9(g) and 9(h) showAccuracy HitMiss FAR CostBDR F1-Score and MCC percentages of the used modelsin each data configuration respectively Figures 10(a) 10(b)10(c) 10(d) 10(e) and 10(f) show Accuracy Hit FAR BDRF1-Score and MCC percentages for the average performanceof the used models on datasets respectively Figures 9 and10 can give us a visual comparison of the performance of theused deep learning models for each data configuration anddataset as well as in all datasets

By taking an inspective look to Figures 9 and 10 we cannotice the stability of deep learning models in such a waythat they are enhancing masquerade detection from a dataconfiguration to another in a consistent pattern To explainthat we will discuss the obtained results from the perspective

Security and Communication Networks 17

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnriched

PU EnrichedPU Truncated

0102030405060708090

100

Accura

cy (

)

(a)

0102030405060708090

100

Hit

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(b)

0

5

10

15

20

25

Miss

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(c)

002040608

112141618

22224

FAR

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(d)

0

5

10

15

20

25

30

35

Cos

t (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(e)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU Truncated

0102030405060708090

100

BDR

()

PU Enriched

(f)

Figure 9 Continued

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 11: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

Security and Communication Networks 11

(9) Construct DNN that is tuned by Hi

(10) Train DNN on Ti

(11) Test DNN on Zi

No

(16) OutputTP FP TN and FN

Yes

End

Start

(1) Input Data configuration D M

(2) Set PSO parameters values

(3) Define Domains for Hyper-parameters

(3) Results Extraction Stage (4) Finishing Stage(1) Initialization Stage (2) Optimization Stage

(6) Execute the proposed PSO-based algorithm

(15) Compute and save TP FP TN and FN for D

(8) Database

(4) ilarr1

(7) Obtain Hi of the user Ui

(5) Create Ti and Zi sets of the user Ui

Hi

(12) Obtain and save TPi FPi TNi andFNi for the user Ui

(14) Is i gt M

(13) ilarri+1

Figure 4 The flowchart of the DNN-experiments

Inputxt

it ctℎt

Outputot

ft

Figure 5 The structure of an LSTM cell [6]

RNN is trained using the back-propagation technique Theproblem is known as gradient vanishing and exploding [42]The gradient vanishing problem occurs when the gradientsignal gets so small over the network which causes learningto become very slow or stop On the other hand the gradientexploding problem occurs when the gradient signal gets solarge in which learning divergesThis problem of the conven-tional RNN limited the use of the RNN to be only in short-termmemory tasks To solve this problem a new architectureof RNN is proposed by Hochreiter and Schmidhuber [43]known as Long Short-Term Memory (LSTM) LSTM uses anew structure called a memory cell that is composed of fourparts which are an input gate a neuron with a self-recurrentconnection a forget gate and the output gateMeanwhile themain goal of using a neuron with a self-recurrent connectionis to record information the aim of using three gates is tocontrol the flow of information from or into the memory cellThe input gate decides if to allow the incoming informationto enter into the memory cell or block it Moreover the forgetgate controls if to pass the previous state of the memory cellto alter the current state of the memory cell or prevent itFinally the output gate determines if to pass the output ofthe memory cell or not Figure 5 shows the structure of anLSTM memory cell Rather than overcoming the problemsof the conventional RNN LSTM model also outperformsthe conventional RNN in terms of performance especially inlong-term memory tasks [5] The LSTM-RNN model can beobtained by replacing every neuron in the hidden layers ofthe RNN to an LSTMmemory cell [6]

In this study we used the LSTM-RNN model to performa static masquerade detection task on all data configurationsAs mentioned in Section 511 there are six data config-urations and each of them will be used in the separate

experiment So we will have six separate LSTM-RNN-experiments each experiment will be on one of the dataconfigurations The methodology of all of these experimentsis the same and as follows for the given data configurationD we firstly prepared all the given data configurationrsquos filesby converting all blocks from text to numerical values andthen normalizing them in [0 1] Next to that for each user119880119894 in D where i=12 M and 119872 is the number of users inD we did the following steps we split the data of 119880119894 into twoindependent sets 119879119894 and 119885119894 which are the training and testsets of the ith user in D respectively The splitting processfollowed the structure of the particular data configurationwhich is described in Section 3 After that we retrieved thestored optimized hyperparameters vector of the ith user (119867119894)from the database which is created in the previous DNN-experiments Then we constructed the RNN model that istuned by119867119894 In order to obtain the LSTM-RNNmodel everyneuron in any of the hidden layers is replaced to an LSTMmemory cell The constructed LSTM-RNN model is trainedon119879119894 and then tested on119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 6 depicts theflowchart of the methodology of LSTM-RNN-experiments

52 Dynamic Classification Approach In contrast of staticclassification approach dynamic classification approach doesnot need a ready-to-use dataset with static features [30] Itcovenants directly with raw data sources such as text imagevideo sound and signal files and extracts features from themdynamically The models that use this approach try to learnand represent features in unsupervised manner Then thesemodels train themselves using the extracted features to beable to classify unseen dataThe deep learningmodels fit verywell for this approach because the main objectives of deeplearning models are the strong ability of automatic featureextraction and self-learning Rather than that dynamicclassification models overcome the problem of the lake ofdatasets it performs more efficient than the static classifica-tionmodels Despite these advantages dynamic classificationapproach has also drawbacks Dynamic classification modelsare slower and take a long time to train if compared with

12 Security and Communication Networks

YesNo

Hi

Start

(1) InputData configuration D M

(2) Prepare files of D

(4) Split data of Ui

into Ti and Zi sets

(7) Train LSTM-RNN model on Ti

(8) Test LSTM-RNN model on Zi

End

(5) Database

(6) Construct LSTM-RNN model that is tuned by Hi

(3) ilarr1

(9) Obtain and save TPi FPi TNi andFNi for the user Ui

(10) ilarri+1

(11) Is i gt M

(13) Output TPFP TN and FN

(12) Compute andsave TP FP TN

and FN for D

Figure 6 The flowchart of the LSTM-RNN-experiments

static classification models due to complex deep structure ofthesemodels as well as the huge amount of computations thatare required to execute Furthermore dynamic classificationmodels require a very large amount of input samples to gainhigh accuracy values

In this research we used six data configurations that areimplemented from three textual datasets In order to applydynamic masquerade detection on these data configurationswe need amodel that is able to extract features from the userrsquoscommand text file dynamically and then classify the user intoone of the two classes that will be either a normal user or amasqueraderTherefore we dealwith a text classification taskThe text classification is defined as a task that assigns a pieceof text (a word a sentence or even a document) to one ormore classes according to its content Indeed there are threetypes of text classification namely sentence classificationsentiment analysis and document categorization In sentenceclassification a given sentence should be assigned correctlyto one of possible classes Furthermore sentiment analysisdetermines if a given sentence is a positive negative orneutral towards a specific subject In contrast documentcategorization deals with documents and determines whichclass from a given set of possible classes a document belongsto According to the nature of dynamic classification as well asthe functionality of text classification deep learning modelsare the fittest among the other machine learning models forthese types of classification due to their powerful capability offeatures learning

A wide range of researches have been accomplished inthe literature in the field of text classification using deeplearning models It was started by LeCun et al in 1998 whenthey proposed a special topology of the Convolutional NeuralNetwork (CNN) known as LeNet family and used it in textclassification efficiently [44]Then various studies have beenpublished to introduce text classification algorithms as wellas the factors that impact the performance [45ndash47] In thestudy [48] the CNNmodel is used for sentence classificationtask over a set of text dataset benchmarks A single one-dimensional CNN is proposed to learn a region-based textembedding [49] X Zhang et al introduced a novel character-based multidimensional CNN for text classification taskswith competitive results [50] In the research [51] a newhierarchal approach calledHierarchal Deep Learning for Text

classification (HDLTex) is proposed and three deep struc-tures which are DNN RNN and CNN are used A recurrentconvolutional network model is introduced [52] for textclassification and high results are obtained on documents-level datasets A novel LSTM-based model is introduced andused for text classification withmultitask learning framework[53] The study [54] proposed a new model called hierarchalattention network for document classification and is testedon six large document-level datasets with good results Acharacter-level text representations approach is proposed andtested for text classification tasks using deep CNN [55]As noticed the CNN is the mostly used deep learningmodel for text classification tasks So we decided to use theCNN to perform dynamic masquerade detection on all dataconfigurations The following subsection reviews the CNNand explains the structure of the used CNN model and themethodology of our CNN-experiments

521 Convolutional Neural Networks The ConvolutionalNeural Network (CNN) is a deep learning model whichis biological-inspired from the animal visual cortex TheCNN can be considered as a special type of the traditionalfeed-forwardArtificial Neural NetworkThemajor differencebetween ANN and CNN is that instead of the fully connectedarchitecture of ANN the individual neurons in CNN areconnected to subregions of the input field The neurons ofthe CNN are arranged in such a way they are tilled to coverthe entire input field The typical CNN consists of five maincomponents namely an input layer the convolutional layerthe pooling layer the fully connected layer and an outputlayer The input layer is where the input data is enteredinto the CNN The first convolutional layer in the CNNconsists of individual neurons that each of them is connectedto a small subset of the input field The neurons in thenext convolutional layers connect only to a subset of theirpreceding pooling layerrsquos outputMoreover the convolutionallayers in the CNN use a set of learnable kernels or filters thateach filter is applied to the specified subset of their precedinglayerrsquos output These filters calculate feature maps in whicheach feature map shares the same weights The poolinglayer also known as a subsampling layer is a nonlineardownsampling function that condenses subsets of its inputThemain goal of using pooling layers in the CNN is to reduce

Security and Communication Networks 13

Userrsquos Command Text Files

Quantization

Input Layer

Convolutional layer

C1 features map P1 features map

Max-Pooling layer

C2 P2 C6 P6

Fully-Connected dropout layers

2048 sigmoid neurons

2048 sigmoid neurons 2

softmaxneurons

Outputdense layer

0 (Normal)1 (Masquerader)

Figure 7 The architecture of the used CNNmodel

the complexity and computations by reducing the size of theirpreceding layerrsquos output There are many pooling nonlinearfunctions that can be used but among them max-poolingis the mostly used which selects the maximum value in thegiven pooling window Typically each convolutional layer inthe CNN is followed by a max-pooling layer The CNN hasone or more stacked convolutional layer and max-poolinglayer pairs to extract features from the entire input and thenmap these features to their next fully connected layerThe toplayers of the CNN are one or more of fully connected layerswhich are similar to hidden layers in the DNN This meansthat neurons of the fully connected layers are connected to allneurons of the preceding layer The output layer is the finallayer in the CNN and is responsible for reporting the outputvalue of the CNN Finally the back-propagation algorithm isusually used to train CNNs via Stochastic Gradient Decent(SGD) to adjust the weights of the fully connected layers [56]There are several variant structures of CNN that are proposedin the literature but LeNet structure which is proposed byLeCun et al [44] is themost common approach used inmanyapplications of computer vision and text classification

Regarding its stability and high efficiency in text clas-sification we selected the CNN model which is proposedin [50] to perform a dynamic masquerade detection on alldata configurationsThe usedmodel is a character-level CNNthat takes a text file as input and outputs the classificationscore (0 if the input text file is related to a normal user or1 otherwise) The used CNN model is from LeNet familyand consists of an input layer followed by six convolutionand max-pooling pairs followed by two fully connectedlayers and finally followed by an output layer In the inputlayer the text quantization process takes place when theused model encodes all letters in the input text file using aone-hot representation from a 70-character alphabet All theconvolutional layers in the used CNN model have a ReLUnonlinear activation functionThe two fully connected layersin the used CNN model are of the type dropout layer withdropout probability equal to 05 In addition to that the twofully connected layers in the usedCNNmodel have a Sigmoidnonlinear activation function as well as they have the samesize of 2048 neurons of each The output layer in the usedCNN model is of the type dense layer as well as it has asoftmax activation function and size of two neurons Theused CNN model is trained by back-propagation algorithmvia SGD Finally we set the following parameters to the

used CNN model learning rate=001 epochs=30 and batchsize=64 These values are obtained experimentally by per-forming a grid search to find the best possible values of theseparameters Figure 7 shows the architecture of the used CNNmodel and is reproduced from Zhang et al (2015) [under theCreative Commons Attribution Licensepublic domain]

In our work we used a CNNmodel to perform a dynamicmasquerade detection task on all data configurations Asmentioned in Section 511 there are six data configurationsand each of them will be used in the separate experimentSo we will have six separate CNN-experiments and eachexperiment will be on one of the data configurations Themethodology of all of these experiments is the same and asfollows for the given data configurationD we firstly preparedall the given data configurationrsquos text files such that each file ofthem represents the training and test sets of a user in119863 Nextto that for each user 119880119894 in D where i=12 M and119872 is thenumber of users in D we did the following steps we split thedata of 119880119894 into two independent sets 119879119894 and 119885119894 which are thetraining and test sets of the ith user in D respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 Furthermorewe also moved each block in the training and test sets of theuser 119880119894 to a separate text file This means that each of thetraining and test sets of the user 119880119894 consists of a specifiednumber of text files in which each text file contains one blockof UNIX commands After that we constructed the usedCNN model The constructed CNN model is trained on 119879119894and then tested on 119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 8 depicts theflowchart of the methodology of CNN-experiments

6 Results and Discussion

We carried out three major empirical experiments whichareDNN-experiments LSTM-RNN-experiments andCNN-experiments Each of them consists of six separate subex-periments where each subexperiment is performed on oneof the data configurations SEA SEA 1v49 Greenberg Trun-cated Greenberg Enriched PU Truncated and PU Enriched

14 Security and Communication Networks

YesNo

Start

(1) Input

(2) Prepare text files of D

(4) Split data of Ui

Ti and Zi text sets(6) Construct the used CNN model

(7) Train CNN model on Ti

(8) Test CNN model on Zi

(13) Output TP FP TN and FNEnd

(5) Move each block in Ti and Zi to a separate text file

Data configuration D M

(3) ilarr1

(12) Compute and save TPFP TN and FN for D

(9) Obtain and save TPi FPi TNiand FNi for the user Ui

(11) Is i gt M

(10) ilarri+1

into

Figure 8 The flowchart of the CNN-experiments

Table 6 The confusion matrix of the masquerade detection out-comes

Actual Class Predicted ClassNormal User Masquerader

Normal User TN FPMasquerader FN TP

Basically our PSO-based DNN hyperparameters selectionalgorithmwas implemented in Python 364 [57]withNumPy[58] Moreover all models (DNN LSTM-RNN CNN) wereconstructed and trained and tested based on Keras [59 60]with TensorFlow 16 [61 62] that backend over CUDA 90[63] and cuDNN 70 [64] In addition to that all experimentswere performed on a workstation with an Intel Core i7 CPU(38GHz 16 MB Cache) 16GB of RAM and theWindows 10operating system In order to accelerate the computations inall experiments we also used a GPU-accelerated computingwith NVIDIA Tesla K20 GPU 5GB GDDR5The experimen-tal environment is processed in 64-bit mode

In any classification task we have four possible outcomesTrue Positive (TP) True Negative (TN) False Positive (FP)and False Negative (FN) We get a TP when a masqueraderis correctly classified as a masquerader Whenever a gooduser is correctly classified as a good user itself we say it isa TN A FP occurs when a good user is misclassified as amasquerader In contrast FN occurs when a masqueraderis misclassified as a good user Table 6 shows the ConfusionMatrix of the masquerade detection outcomes For eachdata configuration we used the obtained outcomes for thatdata configuration to compute twelve well-known evaluationmetrics After that by using these evaluation metrics weassessed the performance of each deep learningmodel on thatdata configuration

For simplicity we divided these evaluation metrics intotwo categories General Classification Measures and Mas-querade Detection Measures The General ClassificationMeasures are metrics that are used for any classification tasknamely Accuracy Precision Recall and F1-Score On theother handMasquerade DetectionMeasures are metrics thatusually are used for a masquerade or intrusion detection

task which are Hit Rate Miss Rate False Alarm RateCost Bayesian Detection Rate Bayesian True Negative RateGeometric Mean andMatthews Correlation CoefficientTheused evaluation metrics definition and their correspondingequations are as follows

(i) Accuracy shows the rate of true detection over all testsets

119860119888119888119906119903119886119888119910 = 119879119875 + 119879119873119879119875 + 119879119873 + 119865119875 + 119865119873 (7)

(ii) Precision shows the rate of correctly classified mas-queraders from all blocks in the test set that areclassified as masqueraders

119875119903119890119888119894119904119894119900119899 = 119879119875119879119875 + 119865119875 (8)

(iii) Recall shows the rate of correctly classified masquer-aders over all masquerader blocks in the test set

119877119890119888119886119897119897 = 119879119875119879119875 + 119865119873 (9)

(iv) F1-Score gives information about the accuracy of aclassifier regarding both Precision (P) and Recall (R)metrics

1198651 119878119888119900119903119890 = 21119875 + 1119877 (10)

(v) Hit Rate shows the rate of correctly classified mas-querader blocks over all masquerader blocks pre-sented in the test set It is also called Hits TruePositive Rate or Detection Rate

119867119894119905 119877119886119905119890 = 119879119875119879119875 + 119865119873 (11)

(vi) Miss Rate is the complement of Hit Rate (Miss=100-Hit) ie it shows the rate of masquerade blocksthat are misclassified as a normal user from allmasquerade blocks in the test set It is also calledMisses or False Negative Rate

119872119894119904119904 119877119886119905119890 = 119865119873119865119873 + 119879119875 (12)

Security and Communication Networks 15

(vii) False Alarm Rate (FAR) gives information about therate of normal user blocks that are misclassified as amasquerader over all normal user blocks presented inthe test set It is also called False Positive Rate

119865119886119897119904119890 119860119897119886119903119898 119877119886119905119890 = 119865119875119865119875 + 119879119873 (13)

(viii) Cost is a metric that was proposed in [9] to evaluatethe efficiency of a classifier concerning bothMiss Rate(MR) and False Alarm Rate (FAR) metrics

119862119900119904119905 = 119872119877 + 6 times 119865119860119877 (14)

(ix) Bayesian Detection Rate (BDR) is a metric basedon Base-Rate Fallacy problem which is addressedby S Axelsson in 1999 [65] Base-Rate Fallacy is abasis of Bayesian statistics and occurs when peo-ple do not take the basic rate of incidence (Base-Rate) into their account when solving problems inprobabilities Unlike Hit Rate metric BDR shows therate of correctly classified masquerader blocks overall test set taking into consideration the base-rate ofmasqueraders Let I and Ilowast denote a masquerade anda normal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectivelyThen BDR can be computed asthe probability P(I | A) according to (15) [65]119861119886119910119890119904119894119886119899 119863119890119905119890119888119905119894119900119899 119877119886119905119890 = 119875 (119868 | 119860)

= 119875 (119868) times 119875 (119860 | 119868)119875 (119868) times 119875 (119860 | 119868) + 119875 (119868lowast) times 119875 (119860 | 119868lowast)(15)

P(I) is the rate of the masquerader blocks in the testset P(A | I) is the Hit Rate P(Ilowast) is the rate of thenormal blocks in the test set and P(A | Ilowast) is the FAR

(x) Bayesian True Negative Rate (BTNR) is also basedon Base-Rate Fallacy and shows the rate of trulyclassified normal blocks over all test set in which thepredicted normal behavior indicates really a normaluser [65] Let I and Ilowast denote a masquerade and anormal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectively Then BTNR can be computedas the probability P(Ilowast | Alowast) according to (16) [65]

119861119886119910119890119904119894119886119899 119879119903119906119890 119873119890119892119886119905119894V119890 119877119886119905119890 = 119875 (119868lowast | 119860lowast)= 119875 (119868lowast) times 119875 (119860lowast | 119868lowast)

119875 (119868lowast) times 119875 (119860lowast | 119868lowast) + 119875 (119868) times 119875 (119860lowast | 119868)(16)

P(Ilowast) is the rate of the normal blocks in the test setP(Alowast | Ilowast) is the True Negative Rate which is easilyobtained by calculating (1-FAR) P(I) is the rate of themasquerader blocks in the test set and P(Alowast | I) isthe Miss Rate

(xi) Geometric Mean (g-mean) is a performance metricthat combines true negative rate and true positive

rate at one specific threshold where both the errorsare considered equal This metric has been usedby several researchers for evaluating classifiers onimbalance dataset [66] It can be computed accordingto (17) [67]

119892 119898119890119886119899 = radic 119879119875 times 119879119873(119879119875 + 119865119873) times (119879119873 + 119865119875) (17)

(xii) Matthews Correlation Coefficient (MCC) is a perfor-mance metric that takes into account true and falsepositives and negatives and is generally regarded asa balanced measure which can be used even if theclasses are of very different sizes (imbalance dataset)[68] MCC has a range of minus1 to 1 where minus1 indicates acompletely wrong binary classifier while 1 indicates acompletely correct binary classifier Unlike the othermetrics discussed aboveMCC takes all the cells of theConfusion Matrix into consideration in its formulawhich can be computed according to (18) [69]

119872119862119862= (119879119875 times 119879119873) minus (119865119875 times 119865119873)radic(119879119875 + 119865119873) times (119879119875 + 119865119875) times (119879119873 + 119865119875) times (119879119873 + 119865119873)

(18)

In the following two subsections we will present our experi-mental results and explain them using two kinds of analysesperformance analysis and ROC curves analysis

61 Performance Analysis The effectiveness of any modelto detect masqueraders depends on its values of evaluationmetrics The higher values of Accuracy Precision RecallF1-Score Hit Rate Bayesian Detection Rate Bayesian TrueNegative Rate Geometric Mean and Matthews CorrelationCoefficient as well as the lower values of Miss Rate FalseAlarm Rate and Cost indicate an efficient classifierThe idealclassifier hasAccuracy andHit Rate values that reach 1 as wellasMiss Rate and False AlarmRate values that reach 0 Table 7presents the percentages of the used evaluation metricsfor DNN-experiments LSTM-RNN-experiments and CNN-experiments Actually the rows labeled by DNN and LSTM-RNN in Table 7 show results of the static masquerade detec-tion by using DNN and LSTM-RNN models respectivelywhereas the rows labeled by CNN in Table 7 show resultsof the dynamic masquerade detection by using CNN modelFurthermore the bold rows represent the best results amongthe same data configuration whereas the underlined valuesare the best for all data configurations

First of all the impact of using our PSO-based algorithmcan be seen in the obtained results of both DNN and LSTM-RNN models The PSO-based algorithm is used to optimizethe selection of DNN hyperparameters that maximized theaccuracy which means that the sum of TP and TN outcomeswill be increased significantly Thus according to (11) and(13) increasing the sum of TP and TN will lead definitelyto the increase of the value of Hit as well as to the decreaseof the value of FAR Although the accuracy values of SEA1v49 data configuration for all models are slightly lower than

16 Security and Communication Networks

Table 7 The results of our experiments

Dataset DataConfiguration Model Evaluation Metrics ()

Accuracy Precision Recall F1-Score Hit Miss FAR Cost BDR BTNR g-mean MCC

SEA Dataset

SEADNN 9808 7626 8485 8033 8485 1515 128 2283 7625 9926 9152 7945

LSTM-RNN 9852 8230 8658 8439 8658 1342 090 1883 8233 9934 9263 8364CNN 9884 8777 8701 8739 8701 1299 059 1651 8772 9937 93 8678

SEA 1v49DNN 9654 9998 9643 9817 9643 357 048 647 9998 5204 9796 7064

LSTM-RNN 9786 9998 9779 9887 9779 221 038 448 9998 6370 987 7874CNN 9878 9999 9874 9936 9874 126 019 240 9999 7551 9927 8622

GreenbergDataset

GreenbergTruncated

DNN 9397 9223 8067 8606 8067 1933 204 3157 9222 9441 8889 8253LSTM-RNN 9472 9488 8153 8770 8153 1847 132 2639 9487 9468 897 8476

CNN 9543 9616 8353 8940 8353 1647 10 2247 9616 9524 9094 8686

GreenbergEnriched

DNN 9757 9692 9240 9461 9240 760 088 1288 9692 9775 957 9308LSTM-RNN 9798 9757 9360 9554 9360 640 070 1060 9756 9810 9641 9428

CNN 9860 9855 9533 9692 9533 467 042 719 9855 9861 9743 9603

PU Dataset

PU TruncatedDNN 810 9959 7861 8786 7861 2139 225 3489 9959 3949 8766 5463

LSTM-RNN 8219 9969 7989 8870 7989 2011 175 3061 9968 4110 886 5646CNN 8375 9974 8164 8979 8164 1836 150 2736 9973 4338 8968 5879

PU EnrichedDNN 9044 9984 8921 9423 8921 1079 10 1679 9984 5672 9398 7064

LSTM-RNN 9131 9988 9018 9478 9018 982 075 1432 9988 5908 9461 7261CNN 9375 9992 9293 9630 9293 707 050 1007 9992 6678 9616 7852

the corresponding values of SEA data configuration also Hitvalues are dramatically increased in SEA 1v49 for all modelsby 10-14 from those that are in the SEA data configurationThis is due to the structure of SEA 1v49 data configurationwhere there are 122500 masquerader blocks in the test setof SEA 1v49 comparing to only 231 blocks in the SEA dataconfiguration Moreover the FAR values of SEA 1v49 for allmodels are significantly lower than the corresponding valuesof SEA data configuration Hence regarding SEA datasetSEA 1v49 is better to use in masquerade detection than SEAdata configuration

On the other hand as we expected Greenberg Enrichedenhanced noticeably the performance of all models in termsof all used evaluation metrics from the corresponding val-ues of Greenberg Truncated data configuration This canbe explained by the fact that Greenberg Enriched dataconfiguration has more information about user behaviorincluding command name parameters aliases and flagscomparing to only command name in Greenberg TruncatedTherefore regarding Greenberg dataset Greenberg Enricheddata configuration is better to use in masquerade detectionthan Greenberg Truncated The same thing happened inPU dataset where its PU Enriched data configuration hasbetter results regarding all models than PU Truncated Thusregarding PU dataset PU Enriched is better to use inmasquerade detection than PUTruncated data configuration

Actually PU Truncated and Greenberg Truncated dataconfigurations simulate SEA and SEA 1v49 data configu-rations where only command name is considered Despitethat regarding all used models SEA 1v49 recorded thebest results among the other truncated data configurationsOn the other hand PU Enriched and Greenberg Enriched

are considered as enriched data configurations where extrainformation about users is taken into consideration Due tothat enriched data configurations help models to build userrsquosbehavior profile more accurately than with truncated dataconfigurations Regarding all models the results associatedwithGreenberg Enriched especially in terms ofAccuracyHitand FAR values are better than of the corresponding valuesof PU Enriched data configuration because PU dataset isvery small masquerade detection dataset with a relatively lownumber of users (only 8 users) Also this reason can explainwhy a few previous works used PU dataset in masqueradedetection However data configurations can be sort for allused models from the upper to lower according to theobtained results as follows SEA 1v49 Greenberg EnrichedPU Enriched SEA Greenberg Truncated and PUTruncated

For the sake of brevity and space limitation we selected asubset of the used performancemetrics inTable 7 to be shownvisually in Figures 9 and 10 Figures 9(a) 9(b) 9(c) 9(d)9(e) 9(f) 9(g) and 9(h) showAccuracy HitMiss FAR CostBDR F1-Score and MCC percentages of the used modelsin each data configuration respectively Figures 10(a) 10(b)10(c) 10(d) 10(e) and 10(f) show Accuracy Hit FAR BDRF1-Score and MCC percentages for the average performanceof the used models on datasets respectively Figures 9 and10 can give us a visual comparison of the performance of theused deep learning models for each data configuration anddataset as well as in all datasets

By taking an inspective look to Figures 9 and 10 we cannotice the stability of deep learning models in such a waythat they are enhancing masquerade detection from a dataconfiguration to another in a consistent pattern To explainthat we will discuss the obtained results from the perspective

Security and Communication Networks 17

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnriched

PU EnrichedPU Truncated

0102030405060708090

100

Accura

cy (

)

(a)

0102030405060708090

100

Hit

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(b)

0

5

10

15

20

25

Miss

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(c)

002040608

112141618

22224

FAR

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(d)

0

5

10

15

20

25

30

35

Cos

t (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(e)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU Truncated

0102030405060708090

100

BDR

()

PU Enriched

(f)

Figure 9 Continued

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 12: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

12 Security and Communication Networks

YesNo

Hi

Start

(1) InputData configuration D M

(2) Prepare files of D

(4) Split data of Ui

into Ti and Zi sets

(7) Train LSTM-RNN model on Ti

(8) Test LSTM-RNN model on Zi

End

(5) Database

(6) Construct LSTM-RNN model that is tuned by Hi

(3) ilarr1

(9) Obtain and save TPi FPi TNi andFNi for the user Ui

(10) ilarri+1

(11) Is i gt M

(13) Output TPFP TN and FN

(12) Compute andsave TP FP TN

and FN for D

Figure 6 The flowchart of the LSTM-RNN-experiments

static classification models due to complex deep structure ofthesemodels as well as the huge amount of computations thatare required to execute Furthermore dynamic classificationmodels require a very large amount of input samples to gainhigh accuracy values

In this research we used six data configurations that areimplemented from three textual datasets In order to applydynamic masquerade detection on these data configurationswe need amodel that is able to extract features from the userrsquoscommand text file dynamically and then classify the user intoone of the two classes that will be either a normal user or amasqueraderTherefore we dealwith a text classification taskThe text classification is defined as a task that assigns a pieceof text (a word a sentence or even a document) to one ormore classes according to its content Indeed there are threetypes of text classification namely sentence classificationsentiment analysis and document categorization In sentenceclassification a given sentence should be assigned correctlyto one of possible classes Furthermore sentiment analysisdetermines if a given sentence is a positive negative orneutral towards a specific subject In contrast documentcategorization deals with documents and determines whichclass from a given set of possible classes a document belongsto According to the nature of dynamic classification as well asthe functionality of text classification deep learning modelsare the fittest among the other machine learning models forthese types of classification due to their powerful capability offeatures learning

A wide range of researches have been accomplished inthe literature in the field of text classification using deeplearning models It was started by LeCun et al in 1998 whenthey proposed a special topology of the Convolutional NeuralNetwork (CNN) known as LeNet family and used it in textclassification efficiently [44]Then various studies have beenpublished to introduce text classification algorithms as wellas the factors that impact the performance [45ndash47] In thestudy [48] the CNNmodel is used for sentence classificationtask over a set of text dataset benchmarks A single one-dimensional CNN is proposed to learn a region-based textembedding [49] X Zhang et al introduced a novel character-based multidimensional CNN for text classification taskswith competitive results [50] In the research [51] a newhierarchal approach calledHierarchal Deep Learning for Text

classification (HDLTex) is proposed and three deep struc-tures which are DNN RNN and CNN are used A recurrentconvolutional network model is introduced [52] for textclassification and high results are obtained on documents-level datasets A novel LSTM-based model is introduced andused for text classification withmultitask learning framework[53] The study [54] proposed a new model called hierarchalattention network for document classification and is testedon six large document-level datasets with good results Acharacter-level text representations approach is proposed andtested for text classification tasks using deep CNN [55]As noticed the CNN is the mostly used deep learningmodel for text classification tasks So we decided to use theCNN to perform dynamic masquerade detection on all dataconfigurations The following subsection reviews the CNNand explains the structure of the used CNN model and themethodology of our CNN-experiments

521 Convolutional Neural Networks The ConvolutionalNeural Network (CNN) is a deep learning model whichis biological-inspired from the animal visual cortex TheCNN can be considered as a special type of the traditionalfeed-forwardArtificial Neural NetworkThemajor differencebetween ANN and CNN is that instead of the fully connectedarchitecture of ANN the individual neurons in CNN areconnected to subregions of the input field The neurons ofthe CNN are arranged in such a way they are tilled to coverthe entire input field The typical CNN consists of five maincomponents namely an input layer the convolutional layerthe pooling layer the fully connected layer and an outputlayer The input layer is where the input data is enteredinto the CNN The first convolutional layer in the CNNconsists of individual neurons that each of them is connectedto a small subset of the input field The neurons in thenext convolutional layers connect only to a subset of theirpreceding pooling layerrsquos outputMoreover the convolutionallayers in the CNN use a set of learnable kernels or filters thateach filter is applied to the specified subset of their precedinglayerrsquos output These filters calculate feature maps in whicheach feature map shares the same weights The poolinglayer also known as a subsampling layer is a nonlineardownsampling function that condenses subsets of its inputThemain goal of using pooling layers in the CNN is to reduce

Security and Communication Networks 13

Userrsquos Command Text Files

Quantization

Input Layer

Convolutional layer

C1 features map P1 features map

Max-Pooling layer

C2 P2 C6 P6

Fully-Connected dropout layers

2048 sigmoid neurons

2048 sigmoid neurons 2

softmaxneurons

Outputdense layer

0 (Normal)1 (Masquerader)

Figure 7 The architecture of the used CNNmodel

the complexity and computations by reducing the size of theirpreceding layerrsquos output There are many pooling nonlinearfunctions that can be used but among them max-poolingis the mostly used which selects the maximum value in thegiven pooling window Typically each convolutional layer inthe CNN is followed by a max-pooling layer The CNN hasone or more stacked convolutional layer and max-poolinglayer pairs to extract features from the entire input and thenmap these features to their next fully connected layerThe toplayers of the CNN are one or more of fully connected layerswhich are similar to hidden layers in the DNN This meansthat neurons of the fully connected layers are connected to allneurons of the preceding layer The output layer is the finallayer in the CNN and is responsible for reporting the outputvalue of the CNN Finally the back-propagation algorithm isusually used to train CNNs via Stochastic Gradient Decent(SGD) to adjust the weights of the fully connected layers [56]There are several variant structures of CNN that are proposedin the literature but LeNet structure which is proposed byLeCun et al [44] is themost common approach used inmanyapplications of computer vision and text classification

Regarding its stability and high efficiency in text clas-sification we selected the CNN model which is proposedin [50] to perform a dynamic masquerade detection on alldata configurationsThe usedmodel is a character-level CNNthat takes a text file as input and outputs the classificationscore (0 if the input text file is related to a normal user or1 otherwise) The used CNN model is from LeNet familyand consists of an input layer followed by six convolutionand max-pooling pairs followed by two fully connectedlayers and finally followed by an output layer In the inputlayer the text quantization process takes place when theused model encodes all letters in the input text file using aone-hot representation from a 70-character alphabet All theconvolutional layers in the used CNN model have a ReLUnonlinear activation functionThe two fully connected layersin the used CNN model are of the type dropout layer withdropout probability equal to 05 In addition to that the twofully connected layers in the usedCNNmodel have a Sigmoidnonlinear activation function as well as they have the samesize of 2048 neurons of each The output layer in the usedCNN model is of the type dense layer as well as it has asoftmax activation function and size of two neurons Theused CNN model is trained by back-propagation algorithmvia SGD Finally we set the following parameters to the

used CNN model learning rate=001 epochs=30 and batchsize=64 These values are obtained experimentally by per-forming a grid search to find the best possible values of theseparameters Figure 7 shows the architecture of the used CNNmodel and is reproduced from Zhang et al (2015) [under theCreative Commons Attribution Licensepublic domain]

In our work we used a CNNmodel to perform a dynamicmasquerade detection task on all data configurations Asmentioned in Section 511 there are six data configurationsand each of them will be used in the separate experimentSo we will have six separate CNN-experiments and eachexperiment will be on one of the data configurations Themethodology of all of these experiments is the same and asfollows for the given data configurationD we firstly preparedall the given data configurationrsquos text files such that each file ofthem represents the training and test sets of a user in119863 Nextto that for each user 119880119894 in D where i=12 M and119872 is thenumber of users in D we did the following steps we split thedata of 119880119894 into two independent sets 119879119894 and 119885119894 which are thetraining and test sets of the ith user in D respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 Furthermorewe also moved each block in the training and test sets of theuser 119880119894 to a separate text file This means that each of thetraining and test sets of the user 119880119894 consists of a specifiednumber of text files in which each text file contains one blockof UNIX commands After that we constructed the usedCNN model The constructed CNN model is trained on 119879119894and then tested on 119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 8 depicts theflowchart of the methodology of CNN-experiments

6 Results and Discussion

We carried out three major empirical experiments whichareDNN-experiments LSTM-RNN-experiments andCNN-experiments Each of them consists of six separate subex-periments where each subexperiment is performed on oneof the data configurations SEA SEA 1v49 Greenberg Trun-cated Greenberg Enriched PU Truncated and PU Enriched

14 Security and Communication Networks

YesNo

Start

(1) Input

(2) Prepare text files of D

(4) Split data of Ui

Ti and Zi text sets(6) Construct the used CNN model

(7) Train CNN model on Ti

(8) Test CNN model on Zi

(13) Output TP FP TN and FNEnd

(5) Move each block in Ti and Zi to a separate text file

Data configuration D M

(3) ilarr1

(12) Compute and save TPFP TN and FN for D

(9) Obtain and save TPi FPi TNiand FNi for the user Ui

(11) Is i gt M

(10) ilarri+1

into

Figure 8 The flowchart of the CNN-experiments

Table 6 The confusion matrix of the masquerade detection out-comes

Actual Class Predicted ClassNormal User Masquerader

Normal User TN FPMasquerader FN TP

Basically our PSO-based DNN hyperparameters selectionalgorithmwas implemented in Python 364 [57]withNumPy[58] Moreover all models (DNN LSTM-RNN CNN) wereconstructed and trained and tested based on Keras [59 60]with TensorFlow 16 [61 62] that backend over CUDA 90[63] and cuDNN 70 [64] In addition to that all experimentswere performed on a workstation with an Intel Core i7 CPU(38GHz 16 MB Cache) 16GB of RAM and theWindows 10operating system In order to accelerate the computations inall experiments we also used a GPU-accelerated computingwith NVIDIA Tesla K20 GPU 5GB GDDR5The experimen-tal environment is processed in 64-bit mode

In any classification task we have four possible outcomesTrue Positive (TP) True Negative (TN) False Positive (FP)and False Negative (FN) We get a TP when a masqueraderis correctly classified as a masquerader Whenever a gooduser is correctly classified as a good user itself we say it isa TN A FP occurs when a good user is misclassified as amasquerader In contrast FN occurs when a masqueraderis misclassified as a good user Table 6 shows the ConfusionMatrix of the masquerade detection outcomes For eachdata configuration we used the obtained outcomes for thatdata configuration to compute twelve well-known evaluationmetrics After that by using these evaluation metrics weassessed the performance of each deep learningmodel on thatdata configuration

For simplicity we divided these evaluation metrics intotwo categories General Classification Measures and Mas-querade Detection Measures The General ClassificationMeasures are metrics that are used for any classification tasknamely Accuracy Precision Recall and F1-Score On theother handMasquerade DetectionMeasures are metrics thatusually are used for a masquerade or intrusion detection

task which are Hit Rate Miss Rate False Alarm RateCost Bayesian Detection Rate Bayesian True Negative RateGeometric Mean andMatthews Correlation CoefficientTheused evaluation metrics definition and their correspondingequations are as follows

(i) Accuracy shows the rate of true detection over all testsets

119860119888119888119906119903119886119888119910 = 119879119875 + 119879119873119879119875 + 119879119873 + 119865119875 + 119865119873 (7)

(ii) Precision shows the rate of correctly classified mas-queraders from all blocks in the test set that areclassified as masqueraders

119875119903119890119888119894119904119894119900119899 = 119879119875119879119875 + 119865119875 (8)

(iii) Recall shows the rate of correctly classified masquer-aders over all masquerader blocks in the test set

119877119890119888119886119897119897 = 119879119875119879119875 + 119865119873 (9)

(iv) F1-Score gives information about the accuracy of aclassifier regarding both Precision (P) and Recall (R)metrics

1198651 119878119888119900119903119890 = 21119875 + 1119877 (10)

(v) Hit Rate shows the rate of correctly classified mas-querader blocks over all masquerader blocks pre-sented in the test set It is also called Hits TruePositive Rate or Detection Rate

119867119894119905 119877119886119905119890 = 119879119875119879119875 + 119865119873 (11)

(vi) Miss Rate is the complement of Hit Rate (Miss=100-Hit) ie it shows the rate of masquerade blocksthat are misclassified as a normal user from allmasquerade blocks in the test set It is also calledMisses or False Negative Rate

119872119894119904119904 119877119886119905119890 = 119865119873119865119873 + 119879119875 (12)

Security and Communication Networks 15

(vii) False Alarm Rate (FAR) gives information about therate of normal user blocks that are misclassified as amasquerader over all normal user blocks presented inthe test set It is also called False Positive Rate

119865119886119897119904119890 119860119897119886119903119898 119877119886119905119890 = 119865119875119865119875 + 119879119873 (13)

(viii) Cost is a metric that was proposed in [9] to evaluatethe efficiency of a classifier concerning bothMiss Rate(MR) and False Alarm Rate (FAR) metrics

119862119900119904119905 = 119872119877 + 6 times 119865119860119877 (14)

(ix) Bayesian Detection Rate (BDR) is a metric basedon Base-Rate Fallacy problem which is addressedby S Axelsson in 1999 [65] Base-Rate Fallacy is abasis of Bayesian statistics and occurs when peo-ple do not take the basic rate of incidence (Base-Rate) into their account when solving problems inprobabilities Unlike Hit Rate metric BDR shows therate of correctly classified masquerader blocks overall test set taking into consideration the base-rate ofmasqueraders Let I and Ilowast denote a masquerade anda normal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectivelyThen BDR can be computed asthe probability P(I | A) according to (15) [65]119861119886119910119890119904119894119886119899 119863119890119905119890119888119905119894119900119899 119877119886119905119890 = 119875 (119868 | 119860)

= 119875 (119868) times 119875 (119860 | 119868)119875 (119868) times 119875 (119860 | 119868) + 119875 (119868lowast) times 119875 (119860 | 119868lowast)(15)

P(I) is the rate of the masquerader blocks in the testset P(A | I) is the Hit Rate P(Ilowast) is the rate of thenormal blocks in the test set and P(A | Ilowast) is the FAR

(x) Bayesian True Negative Rate (BTNR) is also basedon Base-Rate Fallacy and shows the rate of trulyclassified normal blocks over all test set in which thepredicted normal behavior indicates really a normaluser [65] Let I and Ilowast denote a masquerade and anormal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectively Then BTNR can be computedas the probability P(Ilowast | Alowast) according to (16) [65]

119861119886119910119890119904119894119886119899 119879119903119906119890 119873119890119892119886119905119894V119890 119877119886119905119890 = 119875 (119868lowast | 119860lowast)= 119875 (119868lowast) times 119875 (119860lowast | 119868lowast)

119875 (119868lowast) times 119875 (119860lowast | 119868lowast) + 119875 (119868) times 119875 (119860lowast | 119868)(16)

P(Ilowast) is the rate of the normal blocks in the test setP(Alowast | Ilowast) is the True Negative Rate which is easilyobtained by calculating (1-FAR) P(I) is the rate of themasquerader blocks in the test set and P(Alowast | I) isthe Miss Rate

(xi) Geometric Mean (g-mean) is a performance metricthat combines true negative rate and true positive

rate at one specific threshold where both the errorsare considered equal This metric has been usedby several researchers for evaluating classifiers onimbalance dataset [66] It can be computed accordingto (17) [67]

119892 119898119890119886119899 = radic 119879119875 times 119879119873(119879119875 + 119865119873) times (119879119873 + 119865119875) (17)

(xii) Matthews Correlation Coefficient (MCC) is a perfor-mance metric that takes into account true and falsepositives and negatives and is generally regarded asa balanced measure which can be used even if theclasses are of very different sizes (imbalance dataset)[68] MCC has a range of minus1 to 1 where minus1 indicates acompletely wrong binary classifier while 1 indicates acompletely correct binary classifier Unlike the othermetrics discussed aboveMCC takes all the cells of theConfusion Matrix into consideration in its formulawhich can be computed according to (18) [69]

119872119862119862= (119879119875 times 119879119873) minus (119865119875 times 119865119873)radic(119879119875 + 119865119873) times (119879119875 + 119865119875) times (119879119873 + 119865119875) times (119879119873 + 119865119873)

(18)

In the following two subsections we will present our experi-mental results and explain them using two kinds of analysesperformance analysis and ROC curves analysis

61 Performance Analysis The effectiveness of any modelto detect masqueraders depends on its values of evaluationmetrics The higher values of Accuracy Precision RecallF1-Score Hit Rate Bayesian Detection Rate Bayesian TrueNegative Rate Geometric Mean and Matthews CorrelationCoefficient as well as the lower values of Miss Rate FalseAlarm Rate and Cost indicate an efficient classifierThe idealclassifier hasAccuracy andHit Rate values that reach 1 as wellasMiss Rate and False AlarmRate values that reach 0 Table 7presents the percentages of the used evaluation metricsfor DNN-experiments LSTM-RNN-experiments and CNN-experiments Actually the rows labeled by DNN and LSTM-RNN in Table 7 show results of the static masquerade detec-tion by using DNN and LSTM-RNN models respectivelywhereas the rows labeled by CNN in Table 7 show resultsof the dynamic masquerade detection by using CNN modelFurthermore the bold rows represent the best results amongthe same data configuration whereas the underlined valuesare the best for all data configurations

First of all the impact of using our PSO-based algorithmcan be seen in the obtained results of both DNN and LSTM-RNN models The PSO-based algorithm is used to optimizethe selection of DNN hyperparameters that maximized theaccuracy which means that the sum of TP and TN outcomeswill be increased significantly Thus according to (11) and(13) increasing the sum of TP and TN will lead definitelyto the increase of the value of Hit as well as to the decreaseof the value of FAR Although the accuracy values of SEA1v49 data configuration for all models are slightly lower than

16 Security and Communication Networks

Table 7 The results of our experiments

Dataset DataConfiguration Model Evaluation Metrics ()

Accuracy Precision Recall F1-Score Hit Miss FAR Cost BDR BTNR g-mean MCC

SEA Dataset

SEADNN 9808 7626 8485 8033 8485 1515 128 2283 7625 9926 9152 7945

LSTM-RNN 9852 8230 8658 8439 8658 1342 090 1883 8233 9934 9263 8364CNN 9884 8777 8701 8739 8701 1299 059 1651 8772 9937 93 8678

SEA 1v49DNN 9654 9998 9643 9817 9643 357 048 647 9998 5204 9796 7064

LSTM-RNN 9786 9998 9779 9887 9779 221 038 448 9998 6370 987 7874CNN 9878 9999 9874 9936 9874 126 019 240 9999 7551 9927 8622

GreenbergDataset

GreenbergTruncated

DNN 9397 9223 8067 8606 8067 1933 204 3157 9222 9441 8889 8253LSTM-RNN 9472 9488 8153 8770 8153 1847 132 2639 9487 9468 897 8476

CNN 9543 9616 8353 8940 8353 1647 10 2247 9616 9524 9094 8686

GreenbergEnriched

DNN 9757 9692 9240 9461 9240 760 088 1288 9692 9775 957 9308LSTM-RNN 9798 9757 9360 9554 9360 640 070 1060 9756 9810 9641 9428

CNN 9860 9855 9533 9692 9533 467 042 719 9855 9861 9743 9603

PU Dataset

PU TruncatedDNN 810 9959 7861 8786 7861 2139 225 3489 9959 3949 8766 5463

LSTM-RNN 8219 9969 7989 8870 7989 2011 175 3061 9968 4110 886 5646CNN 8375 9974 8164 8979 8164 1836 150 2736 9973 4338 8968 5879

PU EnrichedDNN 9044 9984 8921 9423 8921 1079 10 1679 9984 5672 9398 7064

LSTM-RNN 9131 9988 9018 9478 9018 982 075 1432 9988 5908 9461 7261CNN 9375 9992 9293 9630 9293 707 050 1007 9992 6678 9616 7852

the corresponding values of SEA data configuration also Hitvalues are dramatically increased in SEA 1v49 for all modelsby 10-14 from those that are in the SEA data configurationThis is due to the structure of SEA 1v49 data configurationwhere there are 122500 masquerader blocks in the test setof SEA 1v49 comparing to only 231 blocks in the SEA dataconfiguration Moreover the FAR values of SEA 1v49 for allmodels are significantly lower than the corresponding valuesof SEA data configuration Hence regarding SEA datasetSEA 1v49 is better to use in masquerade detection than SEAdata configuration

On the other hand as we expected Greenberg Enrichedenhanced noticeably the performance of all models in termsof all used evaluation metrics from the corresponding val-ues of Greenberg Truncated data configuration This canbe explained by the fact that Greenberg Enriched dataconfiguration has more information about user behaviorincluding command name parameters aliases and flagscomparing to only command name in Greenberg TruncatedTherefore regarding Greenberg dataset Greenberg Enricheddata configuration is better to use in masquerade detectionthan Greenberg Truncated The same thing happened inPU dataset where its PU Enriched data configuration hasbetter results regarding all models than PU Truncated Thusregarding PU dataset PU Enriched is better to use inmasquerade detection than PUTruncated data configuration

Actually PU Truncated and Greenberg Truncated dataconfigurations simulate SEA and SEA 1v49 data configu-rations where only command name is considered Despitethat regarding all used models SEA 1v49 recorded thebest results among the other truncated data configurationsOn the other hand PU Enriched and Greenberg Enriched

are considered as enriched data configurations where extrainformation about users is taken into consideration Due tothat enriched data configurations help models to build userrsquosbehavior profile more accurately than with truncated dataconfigurations Regarding all models the results associatedwithGreenberg Enriched especially in terms ofAccuracyHitand FAR values are better than of the corresponding valuesof PU Enriched data configuration because PU dataset isvery small masquerade detection dataset with a relatively lownumber of users (only 8 users) Also this reason can explainwhy a few previous works used PU dataset in masqueradedetection However data configurations can be sort for allused models from the upper to lower according to theobtained results as follows SEA 1v49 Greenberg EnrichedPU Enriched SEA Greenberg Truncated and PUTruncated

For the sake of brevity and space limitation we selected asubset of the used performancemetrics inTable 7 to be shownvisually in Figures 9 and 10 Figures 9(a) 9(b) 9(c) 9(d)9(e) 9(f) 9(g) and 9(h) showAccuracy HitMiss FAR CostBDR F1-Score and MCC percentages of the used modelsin each data configuration respectively Figures 10(a) 10(b)10(c) 10(d) 10(e) and 10(f) show Accuracy Hit FAR BDRF1-Score and MCC percentages for the average performanceof the used models on datasets respectively Figures 9 and10 can give us a visual comparison of the performance of theused deep learning models for each data configuration anddataset as well as in all datasets

By taking an inspective look to Figures 9 and 10 we cannotice the stability of deep learning models in such a waythat they are enhancing masquerade detection from a dataconfiguration to another in a consistent pattern To explainthat we will discuss the obtained results from the perspective

Security and Communication Networks 17

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnriched

PU EnrichedPU Truncated

0102030405060708090

100

Accura

cy (

)

(a)

0102030405060708090

100

Hit

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(b)

0

5

10

15

20

25

Miss

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(c)

002040608

112141618

22224

FAR

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(d)

0

5

10

15

20

25

30

35

Cos

t (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(e)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU Truncated

0102030405060708090

100

BDR

()

PU Enriched

(f)

Figure 9 Continued

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 13: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

Security and Communication Networks 13

Userrsquos Command Text Files

Quantization

Input Layer

Convolutional layer

C1 features map P1 features map

Max-Pooling layer

C2 P2 C6 P6

Fully-Connected dropout layers

2048 sigmoid neurons

2048 sigmoid neurons 2

softmaxneurons

Outputdense layer

0 (Normal)1 (Masquerader)

Figure 7 The architecture of the used CNNmodel

the complexity and computations by reducing the size of theirpreceding layerrsquos output There are many pooling nonlinearfunctions that can be used but among them max-poolingis the mostly used which selects the maximum value in thegiven pooling window Typically each convolutional layer inthe CNN is followed by a max-pooling layer The CNN hasone or more stacked convolutional layer and max-poolinglayer pairs to extract features from the entire input and thenmap these features to their next fully connected layerThe toplayers of the CNN are one or more of fully connected layerswhich are similar to hidden layers in the DNN This meansthat neurons of the fully connected layers are connected to allneurons of the preceding layer The output layer is the finallayer in the CNN and is responsible for reporting the outputvalue of the CNN Finally the back-propagation algorithm isusually used to train CNNs via Stochastic Gradient Decent(SGD) to adjust the weights of the fully connected layers [56]There are several variant structures of CNN that are proposedin the literature but LeNet structure which is proposed byLeCun et al [44] is themost common approach used inmanyapplications of computer vision and text classification

Regarding its stability and high efficiency in text clas-sification we selected the CNN model which is proposedin [50] to perform a dynamic masquerade detection on alldata configurationsThe usedmodel is a character-level CNNthat takes a text file as input and outputs the classificationscore (0 if the input text file is related to a normal user or1 otherwise) The used CNN model is from LeNet familyand consists of an input layer followed by six convolutionand max-pooling pairs followed by two fully connectedlayers and finally followed by an output layer In the inputlayer the text quantization process takes place when theused model encodes all letters in the input text file using aone-hot representation from a 70-character alphabet All theconvolutional layers in the used CNN model have a ReLUnonlinear activation functionThe two fully connected layersin the used CNN model are of the type dropout layer withdropout probability equal to 05 In addition to that the twofully connected layers in the usedCNNmodel have a Sigmoidnonlinear activation function as well as they have the samesize of 2048 neurons of each The output layer in the usedCNN model is of the type dense layer as well as it has asoftmax activation function and size of two neurons Theused CNN model is trained by back-propagation algorithmvia SGD Finally we set the following parameters to the

used CNN model learning rate=001 epochs=30 and batchsize=64 These values are obtained experimentally by per-forming a grid search to find the best possible values of theseparameters Figure 7 shows the architecture of the used CNNmodel and is reproduced from Zhang et al (2015) [under theCreative Commons Attribution Licensepublic domain]

In our work we used a CNNmodel to perform a dynamicmasquerade detection task on all data configurations Asmentioned in Section 511 there are six data configurationsand each of them will be used in the separate experimentSo we will have six separate CNN-experiments and eachexperiment will be on one of the data configurations Themethodology of all of these experiments is the same and asfollows for the given data configurationD we firstly preparedall the given data configurationrsquos text files such that each file ofthem represents the training and test sets of a user in119863 Nextto that for each user 119880119894 in D where i=12 M and119872 is thenumber of users in D we did the following steps we split thedata of 119880119894 into two independent sets 119879119894 and 119885119894 which are thetraining and test sets of the ith user in D respectively Thesplitting process followed the structure of the particular dataconfiguration which is described in Section 3 Furthermorewe also moved each block in the training and test sets of theuser 119880119894 to a separate text file This means that each of thetraining and test sets of the user 119880119894 consists of a specifiednumber of text files in which each text file contains one blockof UNIX commands After that we constructed the usedCNN model The constructed CNN model is trained on 119879119894and then tested on 119885119894 After the test process finished weextracted and saved the outcomes TP119894 FP119894 TN 119894 and FN 119894 ofthe ith user in 119863 Then we proceed to the next user in 119863 todo the same previous steps until the last user in119863 is reachedAfter all users in 119863 are completed we computed the overalloutcomes TP FP TN and FN of the data configuration119863 byusing (3) (4) (5) and (6) respectively Figure 8 depicts theflowchart of the methodology of CNN-experiments

6 Results and Discussion

We carried out three major empirical experiments whichareDNN-experiments LSTM-RNN-experiments andCNN-experiments Each of them consists of six separate subex-periments where each subexperiment is performed on oneof the data configurations SEA SEA 1v49 Greenberg Trun-cated Greenberg Enriched PU Truncated and PU Enriched

14 Security and Communication Networks

YesNo

Start

(1) Input

(2) Prepare text files of D

(4) Split data of Ui

Ti and Zi text sets(6) Construct the used CNN model

(7) Train CNN model on Ti

(8) Test CNN model on Zi

(13) Output TP FP TN and FNEnd

(5) Move each block in Ti and Zi to a separate text file

Data configuration D M

(3) ilarr1

(12) Compute and save TPFP TN and FN for D

(9) Obtain and save TPi FPi TNiand FNi for the user Ui

(11) Is i gt M

(10) ilarri+1

into

Figure 8 The flowchart of the CNN-experiments

Table 6 The confusion matrix of the masquerade detection out-comes

Actual Class Predicted ClassNormal User Masquerader

Normal User TN FPMasquerader FN TP

Basically our PSO-based DNN hyperparameters selectionalgorithmwas implemented in Python 364 [57]withNumPy[58] Moreover all models (DNN LSTM-RNN CNN) wereconstructed and trained and tested based on Keras [59 60]with TensorFlow 16 [61 62] that backend over CUDA 90[63] and cuDNN 70 [64] In addition to that all experimentswere performed on a workstation with an Intel Core i7 CPU(38GHz 16 MB Cache) 16GB of RAM and theWindows 10operating system In order to accelerate the computations inall experiments we also used a GPU-accelerated computingwith NVIDIA Tesla K20 GPU 5GB GDDR5The experimen-tal environment is processed in 64-bit mode

In any classification task we have four possible outcomesTrue Positive (TP) True Negative (TN) False Positive (FP)and False Negative (FN) We get a TP when a masqueraderis correctly classified as a masquerader Whenever a gooduser is correctly classified as a good user itself we say it isa TN A FP occurs when a good user is misclassified as amasquerader In contrast FN occurs when a masqueraderis misclassified as a good user Table 6 shows the ConfusionMatrix of the masquerade detection outcomes For eachdata configuration we used the obtained outcomes for thatdata configuration to compute twelve well-known evaluationmetrics After that by using these evaluation metrics weassessed the performance of each deep learningmodel on thatdata configuration

For simplicity we divided these evaluation metrics intotwo categories General Classification Measures and Mas-querade Detection Measures The General ClassificationMeasures are metrics that are used for any classification tasknamely Accuracy Precision Recall and F1-Score On theother handMasquerade DetectionMeasures are metrics thatusually are used for a masquerade or intrusion detection

task which are Hit Rate Miss Rate False Alarm RateCost Bayesian Detection Rate Bayesian True Negative RateGeometric Mean andMatthews Correlation CoefficientTheused evaluation metrics definition and their correspondingequations are as follows

(i) Accuracy shows the rate of true detection over all testsets

119860119888119888119906119903119886119888119910 = 119879119875 + 119879119873119879119875 + 119879119873 + 119865119875 + 119865119873 (7)

(ii) Precision shows the rate of correctly classified mas-queraders from all blocks in the test set that areclassified as masqueraders

119875119903119890119888119894119904119894119900119899 = 119879119875119879119875 + 119865119875 (8)

(iii) Recall shows the rate of correctly classified masquer-aders over all masquerader blocks in the test set

119877119890119888119886119897119897 = 119879119875119879119875 + 119865119873 (9)

(iv) F1-Score gives information about the accuracy of aclassifier regarding both Precision (P) and Recall (R)metrics

1198651 119878119888119900119903119890 = 21119875 + 1119877 (10)

(v) Hit Rate shows the rate of correctly classified mas-querader blocks over all masquerader blocks pre-sented in the test set It is also called Hits TruePositive Rate or Detection Rate

119867119894119905 119877119886119905119890 = 119879119875119879119875 + 119865119873 (11)

(vi) Miss Rate is the complement of Hit Rate (Miss=100-Hit) ie it shows the rate of masquerade blocksthat are misclassified as a normal user from allmasquerade blocks in the test set It is also calledMisses or False Negative Rate

119872119894119904119904 119877119886119905119890 = 119865119873119865119873 + 119879119875 (12)

Security and Communication Networks 15

(vii) False Alarm Rate (FAR) gives information about therate of normal user blocks that are misclassified as amasquerader over all normal user blocks presented inthe test set It is also called False Positive Rate

119865119886119897119904119890 119860119897119886119903119898 119877119886119905119890 = 119865119875119865119875 + 119879119873 (13)

(viii) Cost is a metric that was proposed in [9] to evaluatethe efficiency of a classifier concerning bothMiss Rate(MR) and False Alarm Rate (FAR) metrics

119862119900119904119905 = 119872119877 + 6 times 119865119860119877 (14)

(ix) Bayesian Detection Rate (BDR) is a metric basedon Base-Rate Fallacy problem which is addressedby S Axelsson in 1999 [65] Base-Rate Fallacy is abasis of Bayesian statistics and occurs when peo-ple do not take the basic rate of incidence (Base-Rate) into their account when solving problems inprobabilities Unlike Hit Rate metric BDR shows therate of correctly classified masquerader blocks overall test set taking into consideration the base-rate ofmasqueraders Let I and Ilowast denote a masquerade anda normal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectivelyThen BDR can be computed asthe probability P(I | A) according to (15) [65]119861119886119910119890119904119894119886119899 119863119890119905119890119888119905119894119900119899 119877119886119905119890 = 119875 (119868 | 119860)

= 119875 (119868) times 119875 (119860 | 119868)119875 (119868) times 119875 (119860 | 119868) + 119875 (119868lowast) times 119875 (119860 | 119868lowast)(15)

P(I) is the rate of the masquerader blocks in the testset P(A | I) is the Hit Rate P(Ilowast) is the rate of thenormal blocks in the test set and P(A | Ilowast) is the FAR

(x) Bayesian True Negative Rate (BTNR) is also basedon Base-Rate Fallacy and shows the rate of trulyclassified normal blocks over all test set in which thepredicted normal behavior indicates really a normaluser [65] Let I and Ilowast denote a masquerade and anormal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectively Then BTNR can be computedas the probability P(Ilowast | Alowast) according to (16) [65]

119861119886119910119890119904119894119886119899 119879119903119906119890 119873119890119892119886119905119894V119890 119877119886119905119890 = 119875 (119868lowast | 119860lowast)= 119875 (119868lowast) times 119875 (119860lowast | 119868lowast)

119875 (119868lowast) times 119875 (119860lowast | 119868lowast) + 119875 (119868) times 119875 (119860lowast | 119868)(16)

P(Ilowast) is the rate of the normal blocks in the test setP(Alowast | Ilowast) is the True Negative Rate which is easilyobtained by calculating (1-FAR) P(I) is the rate of themasquerader blocks in the test set and P(Alowast | I) isthe Miss Rate

(xi) Geometric Mean (g-mean) is a performance metricthat combines true negative rate and true positive

rate at one specific threshold where both the errorsare considered equal This metric has been usedby several researchers for evaluating classifiers onimbalance dataset [66] It can be computed accordingto (17) [67]

119892 119898119890119886119899 = radic 119879119875 times 119879119873(119879119875 + 119865119873) times (119879119873 + 119865119875) (17)

(xii) Matthews Correlation Coefficient (MCC) is a perfor-mance metric that takes into account true and falsepositives and negatives and is generally regarded asa balanced measure which can be used even if theclasses are of very different sizes (imbalance dataset)[68] MCC has a range of minus1 to 1 where minus1 indicates acompletely wrong binary classifier while 1 indicates acompletely correct binary classifier Unlike the othermetrics discussed aboveMCC takes all the cells of theConfusion Matrix into consideration in its formulawhich can be computed according to (18) [69]

119872119862119862= (119879119875 times 119879119873) minus (119865119875 times 119865119873)radic(119879119875 + 119865119873) times (119879119875 + 119865119875) times (119879119873 + 119865119875) times (119879119873 + 119865119873)

(18)

In the following two subsections we will present our experi-mental results and explain them using two kinds of analysesperformance analysis and ROC curves analysis

61 Performance Analysis The effectiveness of any modelto detect masqueraders depends on its values of evaluationmetrics The higher values of Accuracy Precision RecallF1-Score Hit Rate Bayesian Detection Rate Bayesian TrueNegative Rate Geometric Mean and Matthews CorrelationCoefficient as well as the lower values of Miss Rate FalseAlarm Rate and Cost indicate an efficient classifierThe idealclassifier hasAccuracy andHit Rate values that reach 1 as wellasMiss Rate and False AlarmRate values that reach 0 Table 7presents the percentages of the used evaluation metricsfor DNN-experiments LSTM-RNN-experiments and CNN-experiments Actually the rows labeled by DNN and LSTM-RNN in Table 7 show results of the static masquerade detec-tion by using DNN and LSTM-RNN models respectivelywhereas the rows labeled by CNN in Table 7 show resultsof the dynamic masquerade detection by using CNN modelFurthermore the bold rows represent the best results amongthe same data configuration whereas the underlined valuesare the best for all data configurations

First of all the impact of using our PSO-based algorithmcan be seen in the obtained results of both DNN and LSTM-RNN models The PSO-based algorithm is used to optimizethe selection of DNN hyperparameters that maximized theaccuracy which means that the sum of TP and TN outcomeswill be increased significantly Thus according to (11) and(13) increasing the sum of TP and TN will lead definitelyto the increase of the value of Hit as well as to the decreaseof the value of FAR Although the accuracy values of SEA1v49 data configuration for all models are slightly lower than

16 Security and Communication Networks

Table 7 The results of our experiments

Dataset DataConfiguration Model Evaluation Metrics ()

Accuracy Precision Recall F1-Score Hit Miss FAR Cost BDR BTNR g-mean MCC

SEA Dataset

SEADNN 9808 7626 8485 8033 8485 1515 128 2283 7625 9926 9152 7945

LSTM-RNN 9852 8230 8658 8439 8658 1342 090 1883 8233 9934 9263 8364CNN 9884 8777 8701 8739 8701 1299 059 1651 8772 9937 93 8678

SEA 1v49DNN 9654 9998 9643 9817 9643 357 048 647 9998 5204 9796 7064

LSTM-RNN 9786 9998 9779 9887 9779 221 038 448 9998 6370 987 7874CNN 9878 9999 9874 9936 9874 126 019 240 9999 7551 9927 8622

GreenbergDataset

GreenbergTruncated

DNN 9397 9223 8067 8606 8067 1933 204 3157 9222 9441 8889 8253LSTM-RNN 9472 9488 8153 8770 8153 1847 132 2639 9487 9468 897 8476

CNN 9543 9616 8353 8940 8353 1647 10 2247 9616 9524 9094 8686

GreenbergEnriched

DNN 9757 9692 9240 9461 9240 760 088 1288 9692 9775 957 9308LSTM-RNN 9798 9757 9360 9554 9360 640 070 1060 9756 9810 9641 9428

CNN 9860 9855 9533 9692 9533 467 042 719 9855 9861 9743 9603

PU Dataset

PU TruncatedDNN 810 9959 7861 8786 7861 2139 225 3489 9959 3949 8766 5463

LSTM-RNN 8219 9969 7989 8870 7989 2011 175 3061 9968 4110 886 5646CNN 8375 9974 8164 8979 8164 1836 150 2736 9973 4338 8968 5879

PU EnrichedDNN 9044 9984 8921 9423 8921 1079 10 1679 9984 5672 9398 7064

LSTM-RNN 9131 9988 9018 9478 9018 982 075 1432 9988 5908 9461 7261CNN 9375 9992 9293 9630 9293 707 050 1007 9992 6678 9616 7852

the corresponding values of SEA data configuration also Hitvalues are dramatically increased in SEA 1v49 for all modelsby 10-14 from those that are in the SEA data configurationThis is due to the structure of SEA 1v49 data configurationwhere there are 122500 masquerader blocks in the test setof SEA 1v49 comparing to only 231 blocks in the SEA dataconfiguration Moreover the FAR values of SEA 1v49 for allmodels are significantly lower than the corresponding valuesof SEA data configuration Hence regarding SEA datasetSEA 1v49 is better to use in masquerade detection than SEAdata configuration

On the other hand as we expected Greenberg Enrichedenhanced noticeably the performance of all models in termsof all used evaluation metrics from the corresponding val-ues of Greenberg Truncated data configuration This canbe explained by the fact that Greenberg Enriched dataconfiguration has more information about user behaviorincluding command name parameters aliases and flagscomparing to only command name in Greenberg TruncatedTherefore regarding Greenberg dataset Greenberg Enricheddata configuration is better to use in masquerade detectionthan Greenberg Truncated The same thing happened inPU dataset where its PU Enriched data configuration hasbetter results regarding all models than PU Truncated Thusregarding PU dataset PU Enriched is better to use inmasquerade detection than PUTruncated data configuration

Actually PU Truncated and Greenberg Truncated dataconfigurations simulate SEA and SEA 1v49 data configu-rations where only command name is considered Despitethat regarding all used models SEA 1v49 recorded thebest results among the other truncated data configurationsOn the other hand PU Enriched and Greenberg Enriched

are considered as enriched data configurations where extrainformation about users is taken into consideration Due tothat enriched data configurations help models to build userrsquosbehavior profile more accurately than with truncated dataconfigurations Regarding all models the results associatedwithGreenberg Enriched especially in terms ofAccuracyHitand FAR values are better than of the corresponding valuesof PU Enriched data configuration because PU dataset isvery small masquerade detection dataset with a relatively lownumber of users (only 8 users) Also this reason can explainwhy a few previous works used PU dataset in masqueradedetection However data configurations can be sort for allused models from the upper to lower according to theobtained results as follows SEA 1v49 Greenberg EnrichedPU Enriched SEA Greenberg Truncated and PUTruncated

For the sake of brevity and space limitation we selected asubset of the used performancemetrics inTable 7 to be shownvisually in Figures 9 and 10 Figures 9(a) 9(b) 9(c) 9(d)9(e) 9(f) 9(g) and 9(h) showAccuracy HitMiss FAR CostBDR F1-Score and MCC percentages of the used modelsin each data configuration respectively Figures 10(a) 10(b)10(c) 10(d) 10(e) and 10(f) show Accuracy Hit FAR BDRF1-Score and MCC percentages for the average performanceof the used models on datasets respectively Figures 9 and10 can give us a visual comparison of the performance of theused deep learning models for each data configuration anddataset as well as in all datasets

By taking an inspective look to Figures 9 and 10 we cannotice the stability of deep learning models in such a waythat they are enhancing masquerade detection from a dataconfiguration to another in a consistent pattern To explainthat we will discuss the obtained results from the perspective

Security and Communication Networks 17

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnriched

PU EnrichedPU Truncated

0102030405060708090

100

Accura

cy (

)

(a)

0102030405060708090

100

Hit

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(b)

0

5

10

15

20

25

Miss

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(c)

002040608

112141618

22224

FAR

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(d)

0

5

10

15

20

25

30

35

Cos

t (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(e)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU Truncated

0102030405060708090

100

BDR

()

PU Enriched

(f)

Figure 9 Continued

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 14: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

14 Security and Communication Networks

YesNo

Start

(1) Input

(2) Prepare text files of D

(4) Split data of Ui

Ti and Zi text sets(6) Construct the used CNN model

(7) Train CNN model on Ti

(8) Test CNN model on Zi

(13) Output TP FP TN and FNEnd

(5) Move each block in Ti and Zi to a separate text file

Data configuration D M

(3) ilarr1

(12) Compute and save TPFP TN and FN for D

(9) Obtain and save TPi FPi TNiand FNi for the user Ui

(11) Is i gt M

(10) ilarri+1

into

Figure 8 The flowchart of the CNN-experiments

Table 6 The confusion matrix of the masquerade detection out-comes

Actual Class Predicted ClassNormal User Masquerader

Normal User TN FPMasquerader FN TP

Basically our PSO-based DNN hyperparameters selectionalgorithmwas implemented in Python 364 [57]withNumPy[58] Moreover all models (DNN LSTM-RNN CNN) wereconstructed and trained and tested based on Keras [59 60]with TensorFlow 16 [61 62] that backend over CUDA 90[63] and cuDNN 70 [64] In addition to that all experimentswere performed on a workstation with an Intel Core i7 CPU(38GHz 16 MB Cache) 16GB of RAM and theWindows 10operating system In order to accelerate the computations inall experiments we also used a GPU-accelerated computingwith NVIDIA Tesla K20 GPU 5GB GDDR5The experimen-tal environment is processed in 64-bit mode

In any classification task we have four possible outcomesTrue Positive (TP) True Negative (TN) False Positive (FP)and False Negative (FN) We get a TP when a masqueraderis correctly classified as a masquerader Whenever a gooduser is correctly classified as a good user itself we say it isa TN A FP occurs when a good user is misclassified as amasquerader In contrast FN occurs when a masqueraderis misclassified as a good user Table 6 shows the ConfusionMatrix of the masquerade detection outcomes For eachdata configuration we used the obtained outcomes for thatdata configuration to compute twelve well-known evaluationmetrics After that by using these evaluation metrics weassessed the performance of each deep learningmodel on thatdata configuration

For simplicity we divided these evaluation metrics intotwo categories General Classification Measures and Mas-querade Detection Measures The General ClassificationMeasures are metrics that are used for any classification tasknamely Accuracy Precision Recall and F1-Score On theother handMasquerade DetectionMeasures are metrics thatusually are used for a masquerade or intrusion detection

task which are Hit Rate Miss Rate False Alarm RateCost Bayesian Detection Rate Bayesian True Negative RateGeometric Mean andMatthews Correlation CoefficientTheused evaluation metrics definition and their correspondingequations are as follows

(i) Accuracy shows the rate of true detection over all testsets

119860119888119888119906119903119886119888119910 = 119879119875 + 119879119873119879119875 + 119879119873 + 119865119875 + 119865119873 (7)

(ii) Precision shows the rate of correctly classified mas-queraders from all blocks in the test set that areclassified as masqueraders

119875119903119890119888119894119904119894119900119899 = 119879119875119879119875 + 119865119875 (8)

(iii) Recall shows the rate of correctly classified masquer-aders over all masquerader blocks in the test set

119877119890119888119886119897119897 = 119879119875119879119875 + 119865119873 (9)

(iv) F1-Score gives information about the accuracy of aclassifier regarding both Precision (P) and Recall (R)metrics

1198651 119878119888119900119903119890 = 21119875 + 1119877 (10)

(v) Hit Rate shows the rate of correctly classified mas-querader blocks over all masquerader blocks pre-sented in the test set It is also called Hits TruePositive Rate or Detection Rate

119867119894119905 119877119886119905119890 = 119879119875119879119875 + 119865119873 (11)

(vi) Miss Rate is the complement of Hit Rate (Miss=100-Hit) ie it shows the rate of masquerade blocksthat are misclassified as a normal user from allmasquerade blocks in the test set It is also calledMisses or False Negative Rate

119872119894119904119904 119877119886119905119890 = 119865119873119865119873 + 119879119875 (12)

Security and Communication Networks 15

(vii) False Alarm Rate (FAR) gives information about therate of normal user blocks that are misclassified as amasquerader over all normal user blocks presented inthe test set It is also called False Positive Rate

119865119886119897119904119890 119860119897119886119903119898 119877119886119905119890 = 119865119875119865119875 + 119879119873 (13)

(viii) Cost is a metric that was proposed in [9] to evaluatethe efficiency of a classifier concerning bothMiss Rate(MR) and False Alarm Rate (FAR) metrics

119862119900119904119905 = 119872119877 + 6 times 119865119860119877 (14)

(ix) Bayesian Detection Rate (BDR) is a metric basedon Base-Rate Fallacy problem which is addressedby S Axelsson in 1999 [65] Base-Rate Fallacy is abasis of Bayesian statistics and occurs when peo-ple do not take the basic rate of incidence (Base-Rate) into their account when solving problems inprobabilities Unlike Hit Rate metric BDR shows therate of correctly classified masquerader blocks overall test set taking into consideration the base-rate ofmasqueraders Let I and Ilowast denote a masquerade anda normal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectivelyThen BDR can be computed asthe probability P(I | A) according to (15) [65]119861119886119910119890119904119894119886119899 119863119890119905119890119888119905119894119900119899 119877119886119905119890 = 119875 (119868 | 119860)

= 119875 (119868) times 119875 (119860 | 119868)119875 (119868) times 119875 (119860 | 119868) + 119875 (119868lowast) times 119875 (119860 | 119868lowast)(15)

P(I) is the rate of the masquerader blocks in the testset P(A | I) is the Hit Rate P(Ilowast) is the rate of thenormal blocks in the test set and P(A | Ilowast) is the FAR

(x) Bayesian True Negative Rate (BTNR) is also basedon Base-Rate Fallacy and shows the rate of trulyclassified normal blocks over all test set in which thepredicted normal behavior indicates really a normaluser [65] Let I and Ilowast denote a masquerade and anormal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectively Then BTNR can be computedas the probability P(Ilowast | Alowast) according to (16) [65]

119861119886119910119890119904119894119886119899 119879119903119906119890 119873119890119892119886119905119894V119890 119877119886119905119890 = 119875 (119868lowast | 119860lowast)= 119875 (119868lowast) times 119875 (119860lowast | 119868lowast)

119875 (119868lowast) times 119875 (119860lowast | 119868lowast) + 119875 (119868) times 119875 (119860lowast | 119868)(16)

P(Ilowast) is the rate of the normal blocks in the test setP(Alowast | Ilowast) is the True Negative Rate which is easilyobtained by calculating (1-FAR) P(I) is the rate of themasquerader blocks in the test set and P(Alowast | I) isthe Miss Rate

(xi) Geometric Mean (g-mean) is a performance metricthat combines true negative rate and true positive

rate at one specific threshold where both the errorsare considered equal This metric has been usedby several researchers for evaluating classifiers onimbalance dataset [66] It can be computed accordingto (17) [67]

119892 119898119890119886119899 = radic 119879119875 times 119879119873(119879119875 + 119865119873) times (119879119873 + 119865119875) (17)

(xii) Matthews Correlation Coefficient (MCC) is a perfor-mance metric that takes into account true and falsepositives and negatives and is generally regarded asa balanced measure which can be used even if theclasses are of very different sizes (imbalance dataset)[68] MCC has a range of minus1 to 1 where minus1 indicates acompletely wrong binary classifier while 1 indicates acompletely correct binary classifier Unlike the othermetrics discussed aboveMCC takes all the cells of theConfusion Matrix into consideration in its formulawhich can be computed according to (18) [69]

119872119862119862= (119879119875 times 119879119873) minus (119865119875 times 119865119873)radic(119879119875 + 119865119873) times (119879119875 + 119865119875) times (119879119873 + 119865119875) times (119879119873 + 119865119873)

(18)

In the following two subsections we will present our experi-mental results and explain them using two kinds of analysesperformance analysis and ROC curves analysis

61 Performance Analysis The effectiveness of any modelto detect masqueraders depends on its values of evaluationmetrics The higher values of Accuracy Precision RecallF1-Score Hit Rate Bayesian Detection Rate Bayesian TrueNegative Rate Geometric Mean and Matthews CorrelationCoefficient as well as the lower values of Miss Rate FalseAlarm Rate and Cost indicate an efficient classifierThe idealclassifier hasAccuracy andHit Rate values that reach 1 as wellasMiss Rate and False AlarmRate values that reach 0 Table 7presents the percentages of the used evaluation metricsfor DNN-experiments LSTM-RNN-experiments and CNN-experiments Actually the rows labeled by DNN and LSTM-RNN in Table 7 show results of the static masquerade detec-tion by using DNN and LSTM-RNN models respectivelywhereas the rows labeled by CNN in Table 7 show resultsof the dynamic masquerade detection by using CNN modelFurthermore the bold rows represent the best results amongthe same data configuration whereas the underlined valuesare the best for all data configurations

First of all the impact of using our PSO-based algorithmcan be seen in the obtained results of both DNN and LSTM-RNN models The PSO-based algorithm is used to optimizethe selection of DNN hyperparameters that maximized theaccuracy which means that the sum of TP and TN outcomeswill be increased significantly Thus according to (11) and(13) increasing the sum of TP and TN will lead definitelyto the increase of the value of Hit as well as to the decreaseof the value of FAR Although the accuracy values of SEA1v49 data configuration for all models are slightly lower than

16 Security and Communication Networks

Table 7 The results of our experiments

Dataset DataConfiguration Model Evaluation Metrics ()

Accuracy Precision Recall F1-Score Hit Miss FAR Cost BDR BTNR g-mean MCC

SEA Dataset

SEADNN 9808 7626 8485 8033 8485 1515 128 2283 7625 9926 9152 7945

LSTM-RNN 9852 8230 8658 8439 8658 1342 090 1883 8233 9934 9263 8364CNN 9884 8777 8701 8739 8701 1299 059 1651 8772 9937 93 8678

SEA 1v49DNN 9654 9998 9643 9817 9643 357 048 647 9998 5204 9796 7064

LSTM-RNN 9786 9998 9779 9887 9779 221 038 448 9998 6370 987 7874CNN 9878 9999 9874 9936 9874 126 019 240 9999 7551 9927 8622

GreenbergDataset

GreenbergTruncated

DNN 9397 9223 8067 8606 8067 1933 204 3157 9222 9441 8889 8253LSTM-RNN 9472 9488 8153 8770 8153 1847 132 2639 9487 9468 897 8476

CNN 9543 9616 8353 8940 8353 1647 10 2247 9616 9524 9094 8686

GreenbergEnriched

DNN 9757 9692 9240 9461 9240 760 088 1288 9692 9775 957 9308LSTM-RNN 9798 9757 9360 9554 9360 640 070 1060 9756 9810 9641 9428

CNN 9860 9855 9533 9692 9533 467 042 719 9855 9861 9743 9603

PU Dataset

PU TruncatedDNN 810 9959 7861 8786 7861 2139 225 3489 9959 3949 8766 5463

LSTM-RNN 8219 9969 7989 8870 7989 2011 175 3061 9968 4110 886 5646CNN 8375 9974 8164 8979 8164 1836 150 2736 9973 4338 8968 5879

PU EnrichedDNN 9044 9984 8921 9423 8921 1079 10 1679 9984 5672 9398 7064

LSTM-RNN 9131 9988 9018 9478 9018 982 075 1432 9988 5908 9461 7261CNN 9375 9992 9293 9630 9293 707 050 1007 9992 6678 9616 7852

the corresponding values of SEA data configuration also Hitvalues are dramatically increased in SEA 1v49 for all modelsby 10-14 from those that are in the SEA data configurationThis is due to the structure of SEA 1v49 data configurationwhere there are 122500 masquerader blocks in the test setof SEA 1v49 comparing to only 231 blocks in the SEA dataconfiguration Moreover the FAR values of SEA 1v49 for allmodels are significantly lower than the corresponding valuesof SEA data configuration Hence regarding SEA datasetSEA 1v49 is better to use in masquerade detection than SEAdata configuration

On the other hand as we expected Greenberg Enrichedenhanced noticeably the performance of all models in termsof all used evaluation metrics from the corresponding val-ues of Greenberg Truncated data configuration This canbe explained by the fact that Greenberg Enriched dataconfiguration has more information about user behaviorincluding command name parameters aliases and flagscomparing to only command name in Greenberg TruncatedTherefore regarding Greenberg dataset Greenberg Enricheddata configuration is better to use in masquerade detectionthan Greenberg Truncated The same thing happened inPU dataset where its PU Enriched data configuration hasbetter results regarding all models than PU Truncated Thusregarding PU dataset PU Enriched is better to use inmasquerade detection than PUTruncated data configuration

Actually PU Truncated and Greenberg Truncated dataconfigurations simulate SEA and SEA 1v49 data configu-rations where only command name is considered Despitethat regarding all used models SEA 1v49 recorded thebest results among the other truncated data configurationsOn the other hand PU Enriched and Greenberg Enriched

are considered as enriched data configurations where extrainformation about users is taken into consideration Due tothat enriched data configurations help models to build userrsquosbehavior profile more accurately than with truncated dataconfigurations Regarding all models the results associatedwithGreenberg Enriched especially in terms ofAccuracyHitand FAR values are better than of the corresponding valuesof PU Enriched data configuration because PU dataset isvery small masquerade detection dataset with a relatively lownumber of users (only 8 users) Also this reason can explainwhy a few previous works used PU dataset in masqueradedetection However data configurations can be sort for allused models from the upper to lower according to theobtained results as follows SEA 1v49 Greenberg EnrichedPU Enriched SEA Greenberg Truncated and PUTruncated

For the sake of brevity and space limitation we selected asubset of the used performancemetrics inTable 7 to be shownvisually in Figures 9 and 10 Figures 9(a) 9(b) 9(c) 9(d)9(e) 9(f) 9(g) and 9(h) showAccuracy HitMiss FAR CostBDR F1-Score and MCC percentages of the used modelsin each data configuration respectively Figures 10(a) 10(b)10(c) 10(d) 10(e) and 10(f) show Accuracy Hit FAR BDRF1-Score and MCC percentages for the average performanceof the used models on datasets respectively Figures 9 and10 can give us a visual comparison of the performance of theused deep learning models for each data configuration anddataset as well as in all datasets

By taking an inspective look to Figures 9 and 10 we cannotice the stability of deep learning models in such a waythat they are enhancing masquerade detection from a dataconfiguration to another in a consistent pattern To explainthat we will discuss the obtained results from the perspective

Security and Communication Networks 17

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnriched

PU EnrichedPU Truncated

0102030405060708090

100

Accura

cy (

)

(a)

0102030405060708090

100

Hit

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(b)

0

5

10

15

20

25

Miss

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(c)

002040608

112141618

22224

FAR

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(d)

0

5

10

15

20

25

30

35

Cos

t (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(e)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU Truncated

0102030405060708090

100

BDR

()

PU Enriched

(f)

Figure 9 Continued

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 15: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

Security and Communication Networks 15

(vii) False Alarm Rate (FAR) gives information about therate of normal user blocks that are misclassified as amasquerader over all normal user blocks presented inthe test set It is also called False Positive Rate

119865119886119897119904119890 119860119897119886119903119898 119877119886119905119890 = 119865119875119865119875 + 119879119873 (13)

(viii) Cost is a metric that was proposed in [9] to evaluatethe efficiency of a classifier concerning bothMiss Rate(MR) and False Alarm Rate (FAR) metrics

119862119900119904119905 = 119872119877 + 6 times 119865119860119877 (14)

(ix) Bayesian Detection Rate (BDR) is a metric basedon Base-Rate Fallacy problem which is addressedby S Axelsson in 1999 [65] Base-Rate Fallacy is abasis of Bayesian statistics and occurs when peo-ple do not take the basic rate of incidence (Base-Rate) into their account when solving problems inprobabilities Unlike Hit Rate metric BDR shows therate of correctly classified masquerader blocks overall test set taking into consideration the base-rate ofmasqueraders Let I and Ilowast denote a masquerade anda normal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectivelyThen BDR can be computed asthe probability P(I | A) according to (15) [65]119861119886119910119890119904119894119886119899 119863119890119905119890119888119905119894119900119899 119877119886119905119890 = 119875 (119868 | 119860)

= 119875 (119868) times 119875 (119860 | 119868)119875 (119868) times 119875 (119860 | 119868) + 119875 (119868lowast) times 119875 (119860 | 119868lowast)(15)

P(I) is the rate of the masquerader blocks in the testset P(A | I) is the Hit Rate P(Ilowast) is the rate of thenormal blocks in the test set and P(A | Ilowast) is the FAR

(x) Bayesian True Negative Rate (BTNR) is also basedon Base-Rate Fallacy and shows the rate of trulyclassified normal blocks over all test set in which thepredicted normal behavior indicates really a normaluser [65] Let I and Ilowast denote a masquerade and anormal behavior respectively Moreover let A andAlowast denote the predicated masquerade and normalbehavior respectively Then BTNR can be computedas the probability P(Ilowast | Alowast) according to (16) [65]

119861119886119910119890119904119894119886119899 119879119903119906119890 119873119890119892119886119905119894V119890 119877119886119905119890 = 119875 (119868lowast | 119860lowast)= 119875 (119868lowast) times 119875 (119860lowast | 119868lowast)

119875 (119868lowast) times 119875 (119860lowast | 119868lowast) + 119875 (119868) times 119875 (119860lowast | 119868)(16)

P(Ilowast) is the rate of the normal blocks in the test setP(Alowast | Ilowast) is the True Negative Rate which is easilyobtained by calculating (1-FAR) P(I) is the rate of themasquerader blocks in the test set and P(Alowast | I) isthe Miss Rate

(xi) Geometric Mean (g-mean) is a performance metricthat combines true negative rate and true positive

rate at one specific threshold where both the errorsare considered equal This metric has been usedby several researchers for evaluating classifiers onimbalance dataset [66] It can be computed accordingto (17) [67]

119892 119898119890119886119899 = radic 119879119875 times 119879119873(119879119875 + 119865119873) times (119879119873 + 119865119875) (17)

(xii) Matthews Correlation Coefficient (MCC) is a perfor-mance metric that takes into account true and falsepositives and negatives and is generally regarded asa balanced measure which can be used even if theclasses are of very different sizes (imbalance dataset)[68] MCC has a range of minus1 to 1 where minus1 indicates acompletely wrong binary classifier while 1 indicates acompletely correct binary classifier Unlike the othermetrics discussed aboveMCC takes all the cells of theConfusion Matrix into consideration in its formulawhich can be computed according to (18) [69]

119872119862119862= (119879119875 times 119879119873) minus (119865119875 times 119865119873)radic(119879119875 + 119865119873) times (119879119875 + 119865119875) times (119879119873 + 119865119875) times (119879119873 + 119865119873)

(18)

In the following two subsections we will present our experi-mental results and explain them using two kinds of analysesperformance analysis and ROC curves analysis

61 Performance Analysis The effectiveness of any modelto detect masqueraders depends on its values of evaluationmetrics The higher values of Accuracy Precision RecallF1-Score Hit Rate Bayesian Detection Rate Bayesian TrueNegative Rate Geometric Mean and Matthews CorrelationCoefficient as well as the lower values of Miss Rate FalseAlarm Rate and Cost indicate an efficient classifierThe idealclassifier hasAccuracy andHit Rate values that reach 1 as wellasMiss Rate and False AlarmRate values that reach 0 Table 7presents the percentages of the used evaluation metricsfor DNN-experiments LSTM-RNN-experiments and CNN-experiments Actually the rows labeled by DNN and LSTM-RNN in Table 7 show results of the static masquerade detec-tion by using DNN and LSTM-RNN models respectivelywhereas the rows labeled by CNN in Table 7 show resultsof the dynamic masquerade detection by using CNN modelFurthermore the bold rows represent the best results amongthe same data configuration whereas the underlined valuesare the best for all data configurations

First of all the impact of using our PSO-based algorithmcan be seen in the obtained results of both DNN and LSTM-RNN models The PSO-based algorithm is used to optimizethe selection of DNN hyperparameters that maximized theaccuracy which means that the sum of TP and TN outcomeswill be increased significantly Thus according to (11) and(13) increasing the sum of TP and TN will lead definitelyto the increase of the value of Hit as well as to the decreaseof the value of FAR Although the accuracy values of SEA1v49 data configuration for all models are slightly lower than

16 Security and Communication Networks

Table 7 The results of our experiments

Dataset DataConfiguration Model Evaluation Metrics ()

Accuracy Precision Recall F1-Score Hit Miss FAR Cost BDR BTNR g-mean MCC

SEA Dataset

SEADNN 9808 7626 8485 8033 8485 1515 128 2283 7625 9926 9152 7945

LSTM-RNN 9852 8230 8658 8439 8658 1342 090 1883 8233 9934 9263 8364CNN 9884 8777 8701 8739 8701 1299 059 1651 8772 9937 93 8678

SEA 1v49DNN 9654 9998 9643 9817 9643 357 048 647 9998 5204 9796 7064

LSTM-RNN 9786 9998 9779 9887 9779 221 038 448 9998 6370 987 7874CNN 9878 9999 9874 9936 9874 126 019 240 9999 7551 9927 8622

GreenbergDataset

GreenbergTruncated

DNN 9397 9223 8067 8606 8067 1933 204 3157 9222 9441 8889 8253LSTM-RNN 9472 9488 8153 8770 8153 1847 132 2639 9487 9468 897 8476

CNN 9543 9616 8353 8940 8353 1647 10 2247 9616 9524 9094 8686

GreenbergEnriched

DNN 9757 9692 9240 9461 9240 760 088 1288 9692 9775 957 9308LSTM-RNN 9798 9757 9360 9554 9360 640 070 1060 9756 9810 9641 9428

CNN 9860 9855 9533 9692 9533 467 042 719 9855 9861 9743 9603

PU Dataset

PU TruncatedDNN 810 9959 7861 8786 7861 2139 225 3489 9959 3949 8766 5463

LSTM-RNN 8219 9969 7989 8870 7989 2011 175 3061 9968 4110 886 5646CNN 8375 9974 8164 8979 8164 1836 150 2736 9973 4338 8968 5879

PU EnrichedDNN 9044 9984 8921 9423 8921 1079 10 1679 9984 5672 9398 7064

LSTM-RNN 9131 9988 9018 9478 9018 982 075 1432 9988 5908 9461 7261CNN 9375 9992 9293 9630 9293 707 050 1007 9992 6678 9616 7852

the corresponding values of SEA data configuration also Hitvalues are dramatically increased in SEA 1v49 for all modelsby 10-14 from those that are in the SEA data configurationThis is due to the structure of SEA 1v49 data configurationwhere there are 122500 masquerader blocks in the test setof SEA 1v49 comparing to only 231 blocks in the SEA dataconfiguration Moreover the FAR values of SEA 1v49 for allmodels are significantly lower than the corresponding valuesof SEA data configuration Hence regarding SEA datasetSEA 1v49 is better to use in masquerade detection than SEAdata configuration

On the other hand as we expected Greenberg Enrichedenhanced noticeably the performance of all models in termsof all used evaluation metrics from the corresponding val-ues of Greenberg Truncated data configuration This canbe explained by the fact that Greenberg Enriched dataconfiguration has more information about user behaviorincluding command name parameters aliases and flagscomparing to only command name in Greenberg TruncatedTherefore regarding Greenberg dataset Greenberg Enricheddata configuration is better to use in masquerade detectionthan Greenberg Truncated The same thing happened inPU dataset where its PU Enriched data configuration hasbetter results regarding all models than PU Truncated Thusregarding PU dataset PU Enriched is better to use inmasquerade detection than PUTruncated data configuration

Actually PU Truncated and Greenberg Truncated dataconfigurations simulate SEA and SEA 1v49 data configu-rations where only command name is considered Despitethat regarding all used models SEA 1v49 recorded thebest results among the other truncated data configurationsOn the other hand PU Enriched and Greenberg Enriched

are considered as enriched data configurations where extrainformation about users is taken into consideration Due tothat enriched data configurations help models to build userrsquosbehavior profile more accurately than with truncated dataconfigurations Regarding all models the results associatedwithGreenberg Enriched especially in terms ofAccuracyHitand FAR values are better than of the corresponding valuesof PU Enriched data configuration because PU dataset isvery small masquerade detection dataset with a relatively lownumber of users (only 8 users) Also this reason can explainwhy a few previous works used PU dataset in masqueradedetection However data configurations can be sort for allused models from the upper to lower according to theobtained results as follows SEA 1v49 Greenberg EnrichedPU Enriched SEA Greenberg Truncated and PUTruncated

For the sake of brevity and space limitation we selected asubset of the used performancemetrics inTable 7 to be shownvisually in Figures 9 and 10 Figures 9(a) 9(b) 9(c) 9(d)9(e) 9(f) 9(g) and 9(h) showAccuracy HitMiss FAR CostBDR F1-Score and MCC percentages of the used modelsin each data configuration respectively Figures 10(a) 10(b)10(c) 10(d) 10(e) and 10(f) show Accuracy Hit FAR BDRF1-Score and MCC percentages for the average performanceof the used models on datasets respectively Figures 9 and10 can give us a visual comparison of the performance of theused deep learning models for each data configuration anddataset as well as in all datasets

By taking an inspective look to Figures 9 and 10 we cannotice the stability of deep learning models in such a waythat they are enhancing masquerade detection from a dataconfiguration to another in a consistent pattern To explainthat we will discuss the obtained results from the perspective

Security and Communication Networks 17

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnriched

PU EnrichedPU Truncated

0102030405060708090

100

Accura

cy (

)

(a)

0102030405060708090

100

Hit

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(b)

0

5

10

15

20

25

Miss

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(c)

002040608

112141618

22224

FAR

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(d)

0

5

10

15

20

25

30

35

Cos

t (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(e)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU Truncated

0102030405060708090

100

BDR

()

PU Enriched

(f)

Figure 9 Continued

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 16: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

16 Security and Communication Networks

Table 7 The results of our experiments

Dataset DataConfiguration Model Evaluation Metrics ()

Accuracy Precision Recall F1-Score Hit Miss FAR Cost BDR BTNR g-mean MCC

SEA Dataset

SEADNN 9808 7626 8485 8033 8485 1515 128 2283 7625 9926 9152 7945

LSTM-RNN 9852 8230 8658 8439 8658 1342 090 1883 8233 9934 9263 8364CNN 9884 8777 8701 8739 8701 1299 059 1651 8772 9937 93 8678

SEA 1v49DNN 9654 9998 9643 9817 9643 357 048 647 9998 5204 9796 7064

LSTM-RNN 9786 9998 9779 9887 9779 221 038 448 9998 6370 987 7874CNN 9878 9999 9874 9936 9874 126 019 240 9999 7551 9927 8622

GreenbergDataset

GreenbergTruncated

DNN 9397 9223 8067 8606 8067 1933 204 3157 9222 9441 8889 8253LSTM-RNN 9472 9488 8153 8770 8153 1847 132 2639 9487 9468 897 8476

CNN 9543 9616 8353 8940 8353 1647 10 2247 9616 9524 9094 8686

GreenbergEnriched

DNN 9757 9692 9240 9461 9240 760 088 1288 9692 9775 957 9308LSTM-RNN 9798 9757 9360 9554 9360 640 070 1060 9756 9810 9641 9428

CNN 9860 9855 9533 9692 9533 467 042 719 9855 9861 9743 9603

PU Dataset

PU TruncatedDNN 810 9959 7861 8786 7861 2139 225 3489 9959 3949 8766 5463

LSTM-RNN 8219 9969 7989 8870 7989 2011 175 3061 9968 4110 886 5646CNN 8375 9974 8164 8979 8164 1836 150 2736 9973 4338 8968 5879

PU EnrichedDNN 9044 9984 8921 9423 8921 1079 10 1679 9984 5672 9398 7064

LSTM-RNN 9131 9988 9018 9478 9018 982 075 1432 9988 5908 9461 7261CNN 9375 9992 9293 9630 9293 707 050 1007 9992 6678 9616 7852

the corresponding values of SEA data configuration also Hitvalues are dramatically increased in SEA 1v49 for all modelsby 10-14 from those that are in the SEA data configurationThis is due to the structure of SEA 1v49 data configurationwhere there are 122500 masquerader blocks in the test setof SEA 1v49 comparing to only 231 blocks in the SEA dataconfiguration Moreover the FAR values of SEA 1v49 for allmodels are significantly lower than the corresponding valuesof SEA data configuration Hence regarding SEA datasetSEA 1v49 is better to use in masquerade detection than SEAdata configuration

On the other hand as we expected Greenberg Enrichedenhanced noticeably the performance of all models in termsof all used evaluation metrics from the corresponding val-ues of Greenberg Truncated data configuration This canbe explained by the fact that Greenberg Enriched dataconfiguration has more information about user behaviorincluding command name parameters aliases and flagscomparing to only command name in Greenberg TruncatedTherefore regarding Greenberg dataset Greenberg Enricheddata configuration is better to use in masquerade detectionthan Greenberg Truncated The same thing happened inPU dataset where its PU Enriched data configuration hasbetter results regarding all models than PU Truncated Thusregarding PU dataset PU Enriched is better to use inmasquerade detection than PUTruncated data configuration

Actually PU Truncated and Greenberg Truncated dataconfigurations simulate SEA and SEA 1v49 data configu-rations where only command name is considered Despitethat regarding all used models SEA 1v49 recorded thebest results among the other truncated data configurationsOn the other hand PU Enriched and Greenberg Enriched

are considered as enriched data configurations where extrainformation about users is taken into consideration Due tothat enriched data configurations help models to build userrsquosbehavior profile more accurately than with truncated dataconfigurations Regarding all models the results associatedwithGreenberg Enriched especially in terms ofAccuracyHitand FAR values are better than of the corresponding valuesof PU Enriched data configuration because PU dataset isvery small masquerade detection dataset with a relatively lownumber of users (only 8 users) Also this reason can explainwhy a few previous works used PU dataset in masqueradedetection However data configurations can be sort for allused models from the upper to lower according to theobtained results as follows SEA 1v49 Greenberg EnrichedPU Enriched SEA Greenberg Truncated and PUTruncated

For the sake of brevity and space limitation we selected asubset of the used performancemetrics inTable 7 to be shownvisually in Figures 9 and 10 Figures 9(a) 9(b) 9(c) 9(d)9(e) 9(f) 9(g) and 9(h) showAccuracy HitMiss FAR CostBDR F1-Score and MCC percentages of the used modelsin each data configuration respectively Figures 10(a) 10(b)10(c) 10(d) 10(e) and 10(f) show Accuracy Hit FAR BDRF1-Score and MCC percentages for the average performanceof the used models on datasets respectively Figures 9 and10 can give us a visual comparison of the performance of theused deep learning models for each data configuration anddataset as well as in all datasets

By taking an inspective look to Figures 9 and 10 we cannotice the stability of deep learning models in such a waythat they are enhancing masquerade detection from a dataconfiguration to another in a consistent pattern To explainthat we will discuss the obtained results from the perspective

Security and Communication Networks 17

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnriched

PU EnrichedPU Truncated

0102030405060708090

100

Accura

cy (

)

(a)

0102030405060708090

100

Hit

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(b)

0

5

10

15

20

25

Miss

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(c)

002040608

112141618

22224

FAR

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(d)

0

5

10

15

20

25

30

35

Cos

t (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(e)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU Truncated

0102030405060708090

100

BDR

()

PU Enriched

(f)

Figure 9 Continued

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 17: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

Security and Communication Networks 17

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnriched

PU EnrichedPU Truncated

0102030405060708090

100

Accura

cy (

)

(a)

0102030405060708090

100

Hit

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(b)

0

5

10

15

20

25

Miss

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(c)

002040608

112141618

22224

FAR

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(d)

0

5

10

15

20

25

30

35

Cos

t (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(e)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU Truncated

0102030405060708090

100

BDR

()

PU Enriched

(f)

Figure 9 Continued

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 18: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

18 Security and Communication Networks

0102030405060708090

100

F1-S

core

()

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(g)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEASEA 1v49GreenbergTruncated

GreenbergEnrichedPU TruncatedPU Enriched

(h)

Figure 9 Evaluation metrics comparison between models on data configurations (a) Accuracy (b) Hit Rate (c) Miss Rate (d) False AlarmRate (e) Cost (f) Bayesian Detection Rate (g) F1-Score (h) Matthews Correlation Coefficient

of static and dynamic masquerade detection techniques Weused DNN and LSTM-RNN models to perform a staticmasquerade detection task on data configurations with staticnumeric features The DNN as well as LSTM-RNN issupported with a PSO-based algorithm that optimized theirhyperparameters to maximize accuracy on the given trainingand test sets of a user Giving the importance to the formerfact our DNN and LSTM-RNN models output masqueradedetection outcomes as better as they can reach for everyuser in the particular data configuration Accordingly at theresult their performance will be enhanced significantly onthat particular data configuration Also this enhancement oftheir performance will be affected by the structure of dataconfiguration which differs from one to another AnywayLSTM-RNN performed better than DNN in terms of allused evaluationmetrics regarding all data configurations anddatasets This is due to the fact that LSTM-RNN model usesLSTMmemory cells instead of artificial neurons in all hiddenlayers Furthermore LSTM-RNN model has self-recurrentconnections as well as connections between memory cells inthe same hidden layer These characteristics of LSTM-RNNwhich do not exist in DNN enable LSTM-RNN to memorizethe previous states explore the dependencies between themand finally use them along with current inputs to predictthe output However the difference between the performanceof LSTM-RNN and DNN models on all data configurationsis relatively small which is between 1 and 3 for Hit andAccuracy and between 02 and 08 for FAR in all cases

Besides static masquerade detection technique we alsoused CNN model to perform a dynamic masquerade detec-tion task on data configurations Indeed CNN is used intext classification task where the input is command textfiles for each user in the particular data configuration Theobtained results show clearly that CNN outperforms both

DNN and LSTM-RNNmodels in terms of all used evaluationmetrics on all data configurations This is due to using adeep structure character-level CNN model which extractedand learned features from the input text files dynamicallyin such a way that the relation between userrsquos individualcommands can be recognized Then the extracted featuresare represented to its fully connected layers to train itself tobuild the userrsquos normal profile which will be used later todetect masquerade attacks efficiently This dynamic processand self-learning capabilities form the major objectives andstrengths of such deep learningmodelsTheusedCNNmodelrecorded very good results on all data configurations suchas Accuracy between 8375 and 9884 Hit between 8164and 9874 and FAR between 019 and 15 Therefore inour study dynamicmasquerade detection is better than staticmasquerade detection technique This gives the impressionthat dynamic masquerade detection technique is the bestchoice for masquerade detection regarding UNIX commandline-based datasets due to the fact that these datasets are orig-inally textual datasets and converting them to static numericdatasetsmay lose them a lot of sufficient information Despitethat DNN and LSTM-RNN also performed very well inmasquerade detection on data configurations

Regarding BDR and BTNR metrics all the used mod-els got high values in most cases which means that theconfidence of the predicated behaviors of these models isvery high Indeed this depends on the structure of theexamined data configuration that is BDR will increase asmuch as both the number of masquerader blocks in thetest set of the examined data configuration and Hit valuesare larger In contrast BTNR will increase as much as thenumber of normal blocks in the test set of the examined dataconfiguration is larger and FAR value is smaller Althoughall the used data configurations are imbalanced all the used

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 19: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

Security and Communication Networks 19

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

Accura

cy (

)

(a)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

0102030405060708090

100

Hit

()

PU DatasetAll Datasets

(b)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0

02

04

06

08

1

12

14

16

18

FAR

()

(c)

0102030405060708090

100

BDR

()

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(d)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

0102030405060708090

100

F1-S

core

()

(e)

0102030405060708090

100

MC

C (

)

DNN LSTM-RNN CNNDeep Learning Models

SEA DatasetGreenberg Dataset

PU DatasetAll Datasets

(f)

Figure 10 Evaluation metrics comparison for the average performance of the models on datasets (a) Accuracy (b) Hit Rate (c) False AlarmRate (d) Bayesian Detection Rate (e) F1-Score (f) Matthews Correlation Coefficient

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 20: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

20 Security and Communication Networks

Table 8 The results of statistical tests

MeasurementsFriedman Test Wilcoxon Test

p1 p2 p3FS FC W P-value W P-value W P-value

TP 12 7 0 00025 0 00025 0 00025FP 12 7 0 00025 0 00025 0 00025TN 12 7 0 00025 0 00025 0 00025FN 12 7 0 00025 0 00025 0 00025

deep learning models got high g-mean percentages for alldata configurations The same thing happened with MCCmetric where all the used deep learningmodels recorded highpercentages for all data configurations except PU Truncated

In order to give a further inspection of the results inTable 7 we also performed two well-known statistical testsnamely Friedman and Wilcoxon tests The Friedman testis a nonparametric test for finding the differences betweenthree or more repeated samples (or treatments) [70] Non-parametric test means that the test does not assume yourdata comes from a particular distribution In our casewe have three repeated treatments (k=3) each for one ofthe used deep learning models and six subjects (N=6) inevery treatment that each subject of them is related toone of the used data configurations The null hypothesis ofFriedman test is that the treatments all have identical effectsMathematically we can reject the null hypothesis if and onlyif the calculated Friedman test statistic (FS) is larger thanthe critical Friedman test value (FC) On the other handWilcoxon test which refers to either the Rank Sum test orthe Signed Rank test is a nonparametric test that comparestwo paired groups (k=2) [71] The test essentially calculatesthe difference between each set of pairs and analyzes thesedifferences In our case we have six subjects (N=6) in everytreatment and three paired groups namely p1=(DNNLSTM-RNN) p2=(DNNCNN) and p3=(LSTM-RNNCNN) Thenull hypothesis of Wilcoxon test is the median differenceof zero Mathematically we can reject the null hypothesisif and only if the probability (P value) which is computedusing Wilcoxon test statistic (W) is smaller than a particularsignificance level (120572) We selected 120572=005 because it isfairly common Table 8 presents the results of Friedman andWilcoxon tests for TP FP TN and FN measurements

It can be noticed from Table 8 that we can reject thenull hypothesis of the Friedman test in all cases becauseFSgtFC This means that the scores of the used deep learningmodels for each measurement are different One way tointerpret the results of Friedman test visually is to plot theCritical Difference Diagram [72] Figure 11 shows the CriticalDifference Diagram of the used deep learning models Inour study we got the Critical Difference (CD) value equal to13533 Also from Table 8 we can reject the null hypothesisof the Wilcoxon test because P value is smaller than alphalevel (00025lt005) in all casesThus we can say that we havestatically significant evidence that medians of every pairedgroup are different Finally the reason of the same results ofall measurements is thatmodels in order (CNN LSTM-RNN

CD

1

2

3DNN CNN

LSTM-RNN

3 2 1

Figure 11TheCriticalDifferenceDiagramof the used deep learningmodels on all data configurations

DNN) have higher scores in TP and TN as well as smallerscores in FP and FN on all data configurations

Figures 12(a) 12(b) 12(c) 12(d) and 12(e) show com-parison between the performance of traditional machinelearning models and the used deep learning models in termsof Hit and FAR percentages for SEA SEA 1v49 GreenbergTruncated Greenberg Enriched and PU Enriched respec-tively We obtained Hit and FAR percentages for traditionalmachine learning models from Table 1 as the best resultsin the literature The difference between the performanceof traditional machine learning and the used deep learningmodels can be perceived obviously DNN LSTM-RNN andCNN outperformed all traditional machine learning modelsdue to a PSO-based algorithm for hyperparameters selectionused with DNN and LSTM-RNN as well as the featurelearning mechanism used with CNN In addition to thatdeep learning models have deeper structures than traditionalmachine learning models The used deep learning modelsincreased considerably Hit percentages by 2-10 as well asdecreased FAR percentages by 1-10 from those in traditionalmachine learning models in most cases

62 ROC Curves Analysis Receiver operating characteristic(ROC) curve is a plot of values of the True Positive Rate (orHit) on Y-axis against the False Positive Rate (or FAR) onX-axis It is widely used for evaluating the performance ofdifferent machine learning algorithms and to show the trade-off between them in order to choose the optimal classifierThe diagonal line of ROC is the reference line which meansthat 50 of performance is achieved The top-left cornerof ROC means the best performance with 100 Figure 13depicts ROC curves of the average performance of each of theused deep learning models over all data configurations ROC

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 21: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

Security and Communication Networks 21

0102030405060708090

100(

)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

ModelsHitFAR

HMM

(a)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

HitFAR

0102030405060708090

100

()

(b)

Naive Bayes SVM DNN LSTM-RNN CNNModels

HitFAR

0102030405060708090

100

()

(c)

Naive Bayes ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(d)

Tree-based ConditionalNaive Bayes

SVM DNN LSTM-RNN CNN

Models

0102030405060708090

100

()

HitFAR

(e)

Figure 12 Models performance comparison for each data configuration (a) SEA (b) SEA 1v49 (c) Greenberg Truncated (d) GreenbergEnriched (e) PU Enriched

curves show that models in the order CNN LSTM-RNN andDNN have the effective masquerade detection performanceover all data configurations However all these three deeplearning models still have a pretty good fit

The area under curve (AUC) is also considered as a well-known measure to compare quantitatively between variousROC curves [73] AUC value of a ROC curve should bebetween 0 and 1The ideal classifierwill haveAUCvalue equalto 1 Table 9 presents AUC values of ROC curves of the usedthree deep learning models which are plotted in Figure 13

We can notice clearly that all these models have very highAUC values that almost reach 1 which means that theireffectiveness to detect masqueraders on UNIX commandline-based datasets is highly acceptable

7 Conclusions

Masquerade detection is one of the most important issues incomputer security field Even various research studies havebeen focused on masquerade detection for more than one

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 22: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

22 Security and Communication Networks

Table 9 AUC values of ROC curves of the used models

Model AUCDNN 09246LSTM-RNN 09385CNN 09617

CNNLSTM-RNNDNN

0

01

02

03

04

05

06

07

08

09

1

True

Pos

itive

Rat

e

01 02 03 04 05 06 07 08 09 10False Positive Rate

Figure 13 ROC curves of the average performance of the usedmodels over all data configurations

decade but the existence of a deep study in that field utilizingdeep learning models is seldom In this paper we presentedan extensive empirical study for masquerade detection usingDNN LSTM-RNN and CNN models We utilized threeUNIX command line datasets which are the mostly used inthe literature In addition to that we implemented six differ-ent data configurations from these datasets The masqueradedetection on these data configurations is carried out usingtwo approaches the first is static and the second is dynamicMeanwhile the static approach is performed by using DNNand LSTM-RNN models which are applied on data con-figurations with static numeric features and the dynamicapproach is performed by using CNN model that extractedfeatures from userrsquos command text files dynamically In orderto solve the problem of hyperparameters selection as well asto gain high performance we also proposed a PSO-basedalgorithm for optimizing hyperparameters of DNN Theproposed PSO-based algorithm seeks to maximize accuracyand is used in the experiments of bothDNN and LSTM-RNNmodels Moreover we employed twelve well-known evalu-ation metrics and statistical tests to assess the performanceof the used models and analyzed the experimental resultsusing performance analysis and ROC curves analysis Ourresults show that the used models performed achievement

in masquerade detection regarding the used datasets andoutperformed the performance of all traditional machinelearning methods in terms of all evaluation metrics Fur-thermore CNN model is superior to both DNN and LSTM-RNN models on all data configurations which means thatthe dynamic masquerade detection is better than the staticone However the results analyses proved the effectiveness ofall used models in masquerade detection in such a way thatthey increased Accuracy and Hit as well as decreased FARpercentages by 1-10 Finally according to the results we canargue that deep learning models seem to be highly promisingtools that can be used in the cyber security field For futurework we recommended extending this work by studying theeffectiveness of deep learning models in intrusion detectionfor both network and cloud environments

Data Availability

Thedata used to support the findings of this study are free andpublicly available on Internet UNIX command line-baseddatasets which are used in this study can be downloaded fromthe following websites SEA dataset at httpwwwschonlaunetintrusionhtml Greenberg dataset upon a request fromits owner at httpsaulcpscucalgarycapmwikiphpHCIRe-sourcesUnixDataReadme and PU dataset at httpkddicsuciedu

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] L Huang A study on masquerade detection 2010 A study onmasquerade detection

[2] M Bertacchini and P Fierens ldquoA survey on masqueraderdetection approachesrdquo in Proceedings of V Congreso Iberoamer-icano de Seguridad Informatica Universidad de la Republica deUruguay 2008

[3] R F Erbacher S Prakash C L Claar and J Couraud ldquoIntru-sion Detection Detecting Masquerade Attacks Using UNIXCommand Linesrdquo in Proceedings of the 6th Annual SecurityConference Las Vegas NV USA April 2007

[4] L Deng ldquoA tutorial survey of architectures algorithms andapplications for deep learningrdquo in APSIPA Transactions onSignal and Information Processing vol 3 Cambridge UniversityPress 2014

[5] X Du Y Cai S Wang and L Zhang ldquoOverview of deeplearningrdquo in Proceedings of the 2016 31st Youth Academic AnnualConference of Chinese Association of Automation (YAC) pp 159ndash164 Wuhan Hubei Province China November 2016

[6] J Kim J Kim H L T Thu and H Kim ldquoLong Short TermMemory Recurrent Neural Network Classifier for IntrusionDetectionrdquo in Proceedings of the 3rd International Conferenceon Platform Technology and Service PlatCon 2016 Republic ofKorea February 2016

[7] M Schonlau W DuMouchel W-H Ju A F Karr M Theusand Y Vardi ldquoComputer intrusion detecting masqueradesrdquoStatistical Science vol 16 no 1 pp 58ndash74 2001

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 23: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

Security and Communication Networks 23

[8] T Okamoto T Watanabe and Y Ishida ldquoTowards an immu-nity-based system for detecting masqueradersrdquo in Proceed-ings of the International Conference on Knowledge-Based andIntelligent Information and Engineering Systems pp 488ndash495Springer Berlin Germany 2003

[9] R A Maxion and T N Townsend ldquoMasquerade detectionusing truncated command linesrdquo in Proceedings of the 2002International Conference on Dependable Systems and NetworksDNS 2002 pp 219ndash228 USA June 2002

[10] K Wang and S J Stolfo ldquoOne-class training for masqueradedetectionrdquo in Proceedings of the Workshop on Data Mining forComputer Security pp 10ndash19 Melbourne FL USA 2003

[11] K H Yung ldquoUsing feedback to improve masquerade detec-tionrdquo in Proceedings of the International Conference on AppliedCryptography andNetwork Security pp 48ndash62 Springer BerlinGermany 2003

[12] K H Yung ldquoUsing self-consistent naive-bayes to detect mas-queradesrdquo in Proceedings of the Pacific-Asia Conference onKnowledge Discovery and Data Mining pp 329ndash340 BerlinGermany 2004

[13] L Chen andM Aritsugi ldquoAn svm-based masquerade detectionmethod with online update using co-occurrence matrixrdquo inProceedings of the International Conference on Detection ofIntrusions and Malware and Vulnerability pp 37ndash53 BerlinGermany 2006

[14] Z Li L Zhitang and L Bin ldquoMasquerade detection systembased on correlation eigenmatrix and support vector machinerdquoin Proceedings of the 2006 International Conference on Com-putational Intelligence and Security ICCIAS 2006 pp 625ndash628China October 2006

[15] H-S Kim and S-D Cha ldquoEmpirical evaluation of SVM-basedmasquerade detection using UNIX commandsrdquo Computers ampSecurity vol 24 no 2 pp 160ndash168 2005

[16] S Greenberg ldquoUsing Unix Collected traces of 168 usersrdquo8833345 Department of Computer Science University ofCalgary Calgary Canada 1988

[17] R A Maxion ldquoMasquerade Detection Using Enriched Com-mand Linesrdquo in Proceedings of the 2003 International Conferenceon Dependable Systems and Networks pp 5ndash14 USA June 2003

[18] M Yang H Zhang and H J Cai ldquoMasquerade detection usingstring kernelsrdquo in Proceedings of the 2007 International Con-ference on Wireless Communications Networking and MobileComputing WiCOM 2007 pp 3676ndash3679 China September2007

[19] T Lane and C E Brodley ldquoAn application of machine learningto anomaly detectionrdquo in Proceedings of the 20th NationalInformation Systems Security Conference vol 377 pp 366ndash380Baltimore USA 1997

[20] M Gebski and R K Wong ldquoIntrusion detection via analy-sis and modelling of user commandsrdquo in Proceedings of theInternational Conference on Data Warehousing and KnowledgeDiscovery pp 388ndash397 Berlin Germany 2005

[21] K V Reddy and N Pushpalatha ldquoConditional naive-bayes todetect masqueradesrdquo International Journal of Computer Scienceand Engineering (IJCSE) vol 3 no 3 pp 13ndash22 2014

[22] L Liu J Luo X Deng and S Li ldquoFPGA-based Accelerationof Deep Neural Networks Using High Level Methodrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 824ndash827Poland November 2015

[23] J S Bergstra R Bardenet Y Bengio et al ldquoAlgorithms forHyper-Parameter optimizationrdquo Advances in Neural Informa-tion Processing Systems pp 2546ndash2554 2011

[24] J Bergstra and Y Bengio ldquoRandom search for hyper-parameteroptimizationrdquo Journal of Machine Learning Research vol 13 pp281ndash305 2012

[25] J Snoek H Larochelle and R P Adams ldquoPractical Bayesianoptimization of machine learning algorithmsrdquo in Proceedings ofthe 26th Annual Conference on Neural Information ProcessingSystems 2012 NIPS 2012 pp 2951ndash2959 USA December 2012

[26] O AhmedAbdalla A Osman Elfaki and Y MohammedAlMurtadha ldquoOptimizing the Multilayer Feed-Forward Arti-ficial Neural Networks Architecture and Training Parametersusing Genetic Algorithmrdquo International Journal of ComputerApplications vol 96 no 10 pp 42ndash48 2014

[27] S Belharbi R Herault C Chatelain and S Adam ldquoDeepMulti-Task Learning with evolving weightsrdquo in Proceedings ofthe 24th European Symposium on Artificial Neural NetworksComputational Intelligence andMachine Learning ESANN 2016pp 141ndash146 Belgium April 2016

[28] S S Tirumala S Ali and C P Ramesh ldquoEvolving deep neuralnetworks A new prospectrdquo in Proceedings of the 12th Inter-national Conference on Natural Computation Fuzzy Systemsand Knowledge Discovery ICNC-FSKD 2016 pp 69ndash74 ChinaAugust 2016

[29] O E David and I Greental ldquoGenetic algorithms for evolvingdeep neural networksrdquo in Proceedings of the 16th Genetic andEvolutionary Computation Conference GECCO 2014 pp 1451-1452 Canada July 2014

[30] A Martin F Fuentes-Hurtado V Naranjo and D CamacholdquoEvolving Deep Neural Networks architectures for Androidmalware classificationrdquo in Proceedings of the 2017 IEEE Congresson Evolutionary Computation CEC 2017 pp 1659ndash1666 SpainJune 2017

[31] P R Lorenzo J Nalepa M Kawulok L S Ramos and JR Pastor ldquoParticle swarm optimization for hyper-parameterselection in deep neural networksrdquo in Proceedings of the 2017Genetic and Evolutionary Computation Conference GECCO2017 pp 481ndash488 New York NY USA July 2017

[32] P R Lorenzo J Nalepa L S Ramos and J R Pastor ldquoHyper-parameter selection in deep neural networks using parallelparticle swarm optimizationrdquo in Proceedings of the 2017 Geneticand Evolutionary Computation Conference Companion GECCO2017 pp 1864ndash1871 New York NY USA July 2017

[33] J Nalepa and P R Lorenzo ldquoConvergence Analysis of PSO forHyper-Parameter Selectionrdquo in Proceedings of the InternationalConference on P2P Parallel Grid Cloud and Internet Comput-ing pp 284ndash295 Springer 2017

[34] F Ye andW Du ldquoParticle swarm optimization-based automaticparameter selection for deep neural networks and its applica-tions in large-scale and high-dimensional datardquo PLoS ONE vol12 no 12 p e0188746 2017

[35] R C Eberhart and J Kennedy ldquoA new optimizer using particleswarm theoryrdquo in Proceedings of the 6th International Sympo-sium on Micro Machine and Human Science (MHS rsquo95) pp 39ndash43 Nagoya Japan October 1995

[36] H J Escalante M Montes and L E Sucar ldquoParticle swarmmodel selectionrdquo Journal of Machine Learning Research vol 10pp 405ndash440 2009

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 24: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

24 Security and Communication Networks

[37] Y Shi and R C Eberhart ldquoParameter selection in particleswarm optimizationrdquo in Proceedings of the International con-ference on evolutionary programming pp 591ndash600 SpringerBerlin Germany 1998

[38] Y Shi and R C Eberhart ldquoEmprirical study of particle swarmoptimizationrdquo in Proceedings of the 1999 congress on IEEEEvolutionary computation CEC 9 vol 3 pp 1945ndash1950 1999

[39] J Kennedy and R Mendes ldquoPopulation structure and particleswarm performancerdquo in Proceedings of the Congress on Evolu-tionary Computation pp 1671ndash1676 Honolulu HI USA May2002

[40] M Clerc and J Kennedy ldquoThe particle swarm-explosion sta-bility and convergence in a multidimensional complex spacerdquoIEEE Transactions on Evolutionary Computation vol 6 no 1pp 58ndash73 2002

[41] C Yin Y Zhu J Fei and X He ldquoADeep Learning Approach forIntrusion Detection Using Recurrent Neural Networksrdquo IEEEAccess vol 5 pp 21954ndash21961 2017

[42] Y Bengio P Simard and P Frasconi ldquoLearning long-termdependencies with gradient descent is difficultrdquo IEEE Transac-tions on Neural Networks and Learning Systems vol 5 no 2 pp157ndash166 1994

[43] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural Computation vol 9 no 8 pp 1735ndash1780 1997

[44] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[45] X Zhang and Y LeCun ldquoText Understanding from scratchrdquohttpsarxivorgabs150201710v5

[46] C C Aggarwal and C Zhai ldquoA survey of text classificationalgorithmsrdquo inMining Text Data pp 163ndash222 Springer BostonMA USA 2012

[47] Y Zhang and B Wallace ldquoA sensitivity analysis of (and prac-titionersrsquo guide to) convolutional neural networks for sentenceclassificationrdquo httpsarxivorgabs151003820

[48] Y Kim ldquoConvolutional neural networks for sentence classifica-tionrdquo httpsarxivorgabs14085882

[49] R Johnson and T Zhang ldquoEffective Use of Word Order forText Categorization with Convolutional Neural Networksrdquo inProceedings of the 2015 Conference of the North AmericanChapter of theAssociation for Computational LinguisticsHumanLanguage Technologies pp 103ndash112 Denver Colorado 2015

[50] X Zhang J Zhao and Y LeCun ldquoCharacter-level Convolu-tional Networks for Text Classificationrdquo Advances in NeuralInformation Processing Systems pp 649ndash657 2015

[51] K Kowsari D E Brown M Heidarysafa K Jafari MeimandiM S Gerber and L E Barnes ldquoHDLTex Hierarchical DeepLearning for Text Classificationrdquo in Proceedings of the 2017 16thIEEE International Conference on Machine Learning and Appli-cations (ICMLA) pp 364ndash371 CancunMexicoDecember 2017

[52] S Lai L Xu K Liu and J Zhao ldquoRecurrent ConvolutionalNeural Networks for Text Classificationrdquo AAAI vol 333 pp2267ndash2273 2015

[53] P Liu XQiu andXHuang ldquoRecurrentNeurlNetwork for TextClassification with Multi-Task Learningrdquo httpsarxivorgabs160505101v1

[54] Z Yang D Yang C Dyer X He A Smola and E HovyldquoHierarchical attention networks for document classificationrdquoin Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics pp1480ndash1489 Human Language Technologies June 2016

[55] J D Prusa and T M Khoshgoftaar ldquoImproving deep neuralnetwork design with new text data representationsrdquo Journal ofBig Data vol 4 no 1 2017

[56] S Albelwi and A Mahmood ldquoA Framework for Designingthe Architectures of Deep Convolutional Neural NetworksrdquoEntropy vol 19 no 6 p 242 2017

[57] ldquoPythonrdquo httpswwwpythonorg[58] ldquoNumPyrdquo httpwwwnumpyorg[59] F Chollet ldquoKerasrdquo 2015 httpsgithubcomfcholletkeras[60] ldquoKerasrdquo httpskerasio[61] M Abadi A Agarwal P Barham et al ldquoTensorflow Large-

scale machine learning on heterogeneous distributed systemsrdquohttpsarxivorgabs160304467v2

[62] TensorFlow httpswwwtensorfloworg[63] ldquoCUDA- Compute Unified Device Architecturerdquo httpsdevel-

opernvidiacomabout-cuda[64] ldquocuDNN- The NVIDIA CUDA Deep Neural Network libraryrdquo

httpsdevelopernvidiacomcudnn[65] S Axelsson ldquoBase-rate fallacy and its implications for the

difficulty of intrusion detectionrdquo in Proceedings of the 1999 6thACM Conference on Computer and Communications Security(ACM CCS) pp 1ndash7 November 1999

[66] Z Zeng and J Gao ldquoImproving SVM classification withimbalance data setrdquo in International Conference on NeuralInformation Processing pp 389ndash398 Springer 2009

[67] M Kubat and S Matwin ldquoAddressing the curse of imbalancedtraining sets one-sided selectionrdquo in Proceedings of the 14thInternational Conference on Machine Learning (ICML vol 97pp 179ndash186 Nashville USA 1997

[68] S Boughorbel F Jarray and M El-Anbari ldquoOptimal classifierfor imbalanced data using Matthews Correlation Coefficientmetricrdquo PLoS ONE vol 12 no 6 p e0177678 2017

[69] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta (BBA) - Protein Structure vol 405 no 2 pp442ndash451 1975

[70] WWDaniel ldquoFriedman two-way analysis of variance by ranksrdquoin Applied Nonparametric Statistics pp 262ndash274 PWS-KentBoston 1990

[71] F Wilcoxon ldquoIndividual comparisons by ranking methodsrdquoBiometrics Bulletin JSTOR vol 1 no 6 pp 80ndash83 1945

[72] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

[73] C Cortes andM Mohri ldquoAUC optimization vs error rate min-imizationrdquo Advances in Neural Information Processing Systemspp 313ndash320 2004

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 25: Deep Learning Approaches for Predictive Masquerade Detectiondownloads.hindawi.com/journals/scn/2018/9327215.pdf · called misuse detection is valuable to use when the mas-querade

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom