botnet detection on flow data using the reconstruction error from...

UPTEC IT 19004

Examensarbete 30 hpJuni 2019

Botnet detection on flow data using the reconstruction error from Autoencoders trained on Word2Vec network embeddings

Kasper Ramström

Institutionen för informationsteknologiDepartment of Information Technology

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Botnet detection on flow data using the reconstructionerror from Autoencoders trained on Word2Vecnetwork embeddingsKasper Ramström

Botnet network attacks are a growing issue in network security. These types of attacks consist out of compromised devices which are used for malicious activities. Many traditional systems use pre-defined pattern matching methods for detecting network intrusions based on the characteristics of previously seen attacks. This means that previously unseen attacks often go unnoticed as they do not have the patterns that the traditional systems are looking for.

This paper proposes an anomaly detection approach which doesn’t use the characteristics of known attacks in order to detect new ones, instead it looks for anomalous events which deviate from the normal. The approach uses Word2Vec, a neural network model used in the field of Natural Language Processing and applies it to NetFlow data in order to produce meaningful representations of network features. These representations together with statistical features are then fed into an Autoencoder model which attempts to reconstruct the NetFlow data, where poor reconstructions could indicate anomalous data.

The approach was evaluated on multiple different flow-based network datasets and the results show that the approach has potential for botnet detection, where the reconstructions can be used as metrics for finding botnet events. However, the results vary for different datasets and performs poorly as a botnet detector for some datasets, indicating that further investigation is required before real world use.

Tryckt av: Reprocentralen ITCUPTEC IT 19004Examinator: Lars-Åke NordénÄmnesgranskare: Raazesh SainudiinHandledare: Håkan Persson

Popularvetenskaplig sammanfattning

Botnatsattacker ar ett vaxande problem i natverkssakerhet. Dessa attacker bestar av en-heter som omedvetet utnyttjas for att utfora skadliga aktiviteter. Manga traditionellasystem anvander fordefinierade monsterigenkanningsmetoder for att upptacka natverks-intrang baserat pa egenskaper hos tidigare sedda attacker. Detta innebar att attacktypersom har egenskaper som inte liknar tidigare igenkanda intrang ofta gar obemarkt forbieftersom de inte har de monster som de traditionella systemen letar efter.

I denna rapport presenteras en metod som gar ut pa avvikelseigenkanning, som inteanvander kannetecknen for tidigare igenkanda attacker for att upptacka nya. Istallet letarmetoden efter anomalier som avviker fran det normala. Metoden anvander Word2Vec,en typ av neuralt natverk som anvands inom sprakteknologi och tillampar den pa NetFlow-data for att producera meningsfulla representationer av natverksparametrar. Dessa re-presentationer tillsammans med statistiska parametrar matas sedan in i en Autoencoder-modell som forsoker rekonstruera NetFlow-data, dar daliga rekonstruktioner kan indi-kera anomalier.

Metoden utvarderades pa flera olika NetFlow-dataset och resultaten visar att tillvaga-gangssattet har potential for botnatsdetektion, dar rekonstruktionerna kan anvandas sommatvarden for att upptacka botnatsaktivitet. Resultaten varierar emellertid for olika da-taset och fungerar daligt for botnatsdetektion pa vissa dataset, vilket indikerar att ytter-ligare utredning kravs innan metoden appliceras i verkliga situationer.

Contents

1 Introduction 1

1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3

2.1 Network security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 NetFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Publicly available NetFlow datasets . . . . . . . . . . . . . . . 6

2.3 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4.1 Skip Gram model . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Related work 13

3.1 Network intrusion detection using classification models . . . . . . . . . 13

3.2 Network intrusion detection using anomaly detection models . . . . . . 15

3.3 Flow2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4 IP2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Methods 19

4.1 Data selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3 System structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3.1 Word2Vec model . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3.2 Autoencoder model . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Results 27

5.1 Flow reconstruction error . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2 Precision-Recall curves . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3 AUC scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6 Discussion 34

6.1 Generalizability of the system . . . . . . . . . . . . . . . . . . . . . . 34

6.2 Model architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.3 Dataset selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7 Conclusion 37

7.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7.1.1 Threshold placement . . . . . . . . . . . . . . . . . . . . . . . 38

7.1.2 Advanced Autoencoder architectures . . . . . . . . . . . . . . 38

7.1.3 Use reconstruction error as metric for other models . . . . . . . 39

Glossary

AUC Area Under Curve. 13, 15, 16, 24, 25, 26, 27, 30, 33, 34

FN False Negative. 24, 25, 27

FP False Positive. 24, 25, 26, 27

FPR False Positive Rate. 13, 15, 16, 25, 26, 30

IDS Intrusion Detection System. 1, 2, 3, 4, 13, 16

LSTM Long short-term memory. 14, 15, 16, 30, 37, 38

MSE Mean Squared Error. 23, 24, 27, 34, 37, 38

NLP Natural Language Processing. 1, 2, 7, 15, 37

PCAP Packet capture. 4, 5, 13

TN True Negative. 25, 26

TP True Positive. 24, 25, 27

TPR True Positive Rate. 13, 15, 16, 24, 25, 26, 30

List of Figures

1 Intrusion Detection System example . . . . . . . . . . . . . . . . . . . 4

2 Neural network example . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 One-hot encoding example . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Word embeddings example . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Word2Vec Skip Gram model comparison . . . . . . . . . . . . . . . . 10

6 Autoencoder architecture example . . . . . . . . . . . . . . . . . . . . 11

7 System structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

8 Word2Vec and Autoencoder models used . . . . . . . . . . . . . . . . 22

9 Reconstruction error histogram plots part 1 . . . . . . . . . . . . . . . 28

10 Reconstruction error histogram plots part 2 . . . . . . . . . . . . . . . 29

11 Precision-Recall curve part 1 . . . . . . . . . . . . . . . . . . . . . . . 31

12 Precision-Recall curve part 2 . . . . . . . . . . . . . . . . . . . . . . . 32

13 AUC scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

List of Tables

1 Related work - supervised results . . . . . . . . . . . . . . . . . . . . . 15

2 Related work - unsupervised results . . . . . . . . . . . . . . . . . . . 17

3 CTU-13 scenarios used . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 CTU-13 evaluation splits used . . . . . . . . . . . . . . . . . . . . . . 25

1 Introduction

1 Introduction

With rapidly increasing amounts of vulnerable and connected devices, network securitybecomes a more and more prevalent topic [8, 16, 24, p. 1-6]. Protecting against networkbased attacks becomes increasingly important for corporations, governments and indi-viduals as more devices are connected to the internet and attacks grow more complexand harmful [28]. This means that traditional Intrusion Detection Systems (IDSs) thatrely on identifying similarities with previously seen attacks and rule based systems areno longer able to keep up and identify malicious activity. These systems often use man-ual threshold values based on statistical features which are a bad fit for systems withlarge number of measurements and poorly understood behaviour.

This project proposes a solution which applies Natural Language Processing (NLP)techniques which provide deep semantic relationships of language data [22, 23, 32] onnetwork traffic logs in order to extract similar features by treating network communi-cation as words and sentences. By extracting this information, an Autoencoder neuralnetwork can be trained on the network traffic logs in order to reconstruct normal dataand identify botnet data as anomalies without any predefined definitions of botnet data.

The rest of the report is outlined as follows: Section 2 provides background informationregarding network security, anomaly detection, neural networks, the NLP techniquesthat are later applied to network data and Autoencoders. In Section 3 similar solutionsand related work is presented. Section 4 provides the proposed solution, evaluationcriterias and experiments for the system and Section 5 analyzes the results. Finally inSection 6 the results are discussed and Section 7 provides a conclusion.

1.1 Purpose

This project aims to investigate the potential of combining advanced techniques fromthe field of NLP with Autoencoders for identifying network threats in the form of bot-net traffic, allowing network forensics to handle and protect against them accordingly.Traditionally, this is done using pattern matching algorithms for finding known attacks[8, p. 1-6]. This traditional approach allows domain experts to set up classification sys-tems which handle and prevent previously known attacks based on their characteristics.These systems can use rules or supervised machine learning algorithms which classifydata. However, they do not cope with anomalies, data with previously unseen prop-erties that don’t fall into existing categories. Instead, unsupervised machine learningmodels can be trained to identify what normal data looks like and then identify abnor-malities, data which deviates from what the model deems as normal. The project is done

1

1 Introduction

in cooperation with Combient AB an IT company specialized in providing data sciencesolutions for industry problems. With this project, Combient wants to investigate thefeasibility and usefulness of a network anomaly detection system.

1.2 Delimitations

As this project aims to investigate the potential of Autoencoders in combination withNLP techniques for botnet detection it does not focus on comparing different techniquesin order to find the best possible system for botnet detection. Similarly, it offers no so-lution for on handling botnet events, such as preventing intrusions or giving instructionson how to incorporate a botnet detector into an IDS. It also does not aim to offer a so-lution for collecting network data or monitoring a network and run a botnet detectionsystem in real time. Instead, the project aims to use pre-existing datasets and analyzeand run experiments on them in an offline manner, where processing time and similarmetrics are not of special concern.

2

2 Background

2 Background

This section gives a brief introduction to the concepts and techniques related to net-work security, including a short review of the area and an introduction to NetFlow andanomaly detection. Additionally, a brief description is given of neural networks as wellas two specific neural network types: Word2Vec and Autoencoders.

2.1 Network security

Network security has gained tremendous traction the last few years due to the largeincrease in data over the Internet [5, p.v-x]. With this large data increase, IDSs increas-ingly leverage data science models in multiple stages of network security. This includesdirectly preventing network attacks, visualizing network behaviour and providing in-sights for network administrators. However, with network intrusions becoming increas-ingly used and more sophisticated, traditional pattern recognition systems are becominginadequate.

In order to protect a network against malicious behaviour, one must first capture record-ings of the network’s traffic. This is often done by using sniffing programs such asWireShark, which listens to a network and records each packet sent and received [33, 2,p.2-14, p.3-32]. This recording can be done at multiple locations in a network, at router-level and device-level. A router-level sniffer records all packets sent to and from thenetwork, however it does not capture traffic sent between computers inside the network.At device-level the sniffer records packets sent to and from the individual computer,which allows for capturing traffic between computers inside the network, but loses thecapability of capturing traffic sent externally from the network.

Botnets are one of the most serious network threats currently and are growing heavilyin use [34, 1]. A botnet is a group of compromised connected devices controlled bya malicious host (called botmaster) in order to perform some sort of network attack.Due to the the fact that a botnet attack leverages previously benign devices which havebeen compromised, they can be difficult to detect and prevent. Beigi et al. [1] showthat it is easy for these types of intrusions to circumvent traditional statistical featuresused in pattern recognition systems by sending junk traffic, meaningless traffic that ismeant to distract systems from intrusions. Therefore, modern machine learning ap-proaches which don’t rely on pre-set statistical patterns are more suitable and are beingincreasingly used in IDSs which often leverage different subsystems with specific pur-poses [26]. This could be a combination of anomaly detection methods for detectingunknown network behaviour, such as previously unseen attacks and a supervised clas-

3

2 Background

sification model looking for known patterns of malicious behaviour [9]. Supervisedmodels for example are often used in anti-malware software which have pre-installedintrusion detection patterns.

2.1.1 Anomaly detection

Anomaly detection is the process of finding outliers, data points that are unusual andunexpected [8, p. 7-9]. To discover these abnormalities one must first have a sense ofwhat is normal in order to determine what is not. Normal behaviour is expected andcan therefore be predicted, even though it may be highly complex. The purpose of ananomaly detector system is to discover anomalies. Natural extensions to this includealert systems for informing about anomalies being discovered or systems that handleanomalies and take appropriate action. In the context of network security, anomalydetection can be used to detect fraudulent network behaviour, targeted attacks or othertypes of network events that are unusual. It is then up to some other system or a networkadministrator to take action upon the anomalous event. Figure 1 illustrates a typical IDSwhere an anomaly detection model is used in conjunction with a classifier where bothmodels are fed input data. The classifier tries to match the input data to previouslyknown attacks whilst the anomaly detector determines whether the data is considerednormal, allowing it to find previously unknown anomalies.

Figure 1: Example IDS set up where input data is fed to both an anomaly detector and aclassifier which in turn report to an alarm system. The anomaly detector is responsiblefor detecting new, unknown patterns while the classifier tries to find patterns similar topreviously known patterns.

2.2 NetFlow

NetFlow is a network traffic logging standard developed by Cisco [7, p. 30-33] whichis now used by multiple network hardware manufacturers. It is generally generated

4

2 Background

by network routers and switches, or by conversion of Packet capture (PCAP) data toNetFlow. Originally intended to be used for billing purposes, it has also found largetraction when used for network analysis as it contains compact network information.Since being developed and first released during the nineties [19, p. 10-12], NetFlow hasundergone many revisions, with version five being the most commonly used standardand version nine being the latest. However, as version five does not support InternetProtocol version 6 (IPv6), version nine becomes increasingly more useful. The latestversion also allows for a modular set of features to be logged.

The central part of NetFlow data is a flow, which is an aggregation of a sequence ofpackets within a narrow timeframe that have the same source address, source port, des-tination address, destination port and protocol. Along with these features, a flow at itscore also contains total packets sent, total bytes sent, start time, and end time. Further-more, modifications can be made to version nine so that more parameters are logged.Thus, flow data provides a compact view of network traffic within timeframes. A flowcan come in two different forms, as unidirectional, and bidirectional. Unidirectionalflows are aggregated as described above. Bidirectional flows are further aggregated,where reverse-pairs of flows, where the source IP address and port become the destina-tion IP address and port (and vice versa) are combined into a single flow. This effectivelyresults in half as many flows in a bidirectional dataset, compared to a unidirectional. Us-ing bidirectional flows circumvents the problem of differentiating between a network’sclient and server, however due to aggregation, information is also lost [10].

A flow does not contain any of the packet payload normally present in PCAP. This canbe advantageous since a lot of traffic is encrypted nowadays, which renders the packetpayload useless regardless [26, 1]. Similarly, logging regular PCAP data can be a pri-vacy concern due to sensitive information in non-encrypted packet payloads, which flowdata circumvents, although the privacy concern regarding IP address communication isstill present. For network forensics and analysis purposes, packet payloads can containvaluable information which is lost when converted to flow format. However, PCAP logsoften grow too large and become difficult to parse, due to overwhelming amounts ofinformation [20]. Since a flow is a form of compact aggregation of PCAP data, it ismuch smaller, easier to store and handle. Other forms of aggregation often remove toomuch import traffic information in favor for smaller log files. NetFlow on the other handprovides a good compromise between a compact representation yet detailed enough formeaningful analysis.

Bilge et al. [3] have found that the two (rather contradictory) major difficulties withbotnet detection are the lack of big and rich datasets and that these mostly appear asPCAP logs, which provide too much information to be easily processed. Instead, theauthors found that flow datasets should be used, which can be big and rich enough yetis easily processable since its lack of packet payload.

5

2 Background

2.2.1 Publicly available NetFlow datasets

There are many flow based datasets publicly available which contain both regular net-work traffic as well as malicious traffic, such as traffic caused by botnets, however, manyof them contain severe flaws which render them ineffective for network analysis [20].Often, the datasets contain obsolete anomaly types, such as attacks that are no longerrelevant and the legitimate (regular) traffic is often missing or not representative of thatof real networks. Similarly, the datasets are often outdated and no longer resemblesmodern network traffic. Another issue is that many of the datasets used in research pa-pers are not disclosed due to privacy concern and difficulties labeling them. Labelinga dataset requires expert knowledge and even then is not always reliable. To determinea good dataset, the authors define some general criteria that must be met, namely thedataset should be recent, labeled, rich, long and real:

Recent The value of a dataset decreases over time as network behaviour is evolvingand increasing in complexity. Therefore, a dataset much be recent in order to be repre-sentative of real networks

Labeled All points in a dataset must be labeled, as detailed as possible to provide adescription for the flow.

Rich A dataset must be rich with many different types of network traffic and correlatedsuch that malware and regular traffic is captured at the same time, rather than injectingone or the other afterwards in order to enrichen the dataset.

Long A dataset must span over enough time such that meaningful analysis can beapplied.

Real To be as close to a real network environment as possible, a dataset should beformed from a real production or laboratory network. The opposite of this is a syntheticdataset, which contains simulated or generated traffic. The downside of using a syntheticdataset is that it is difficult to ensure the network’s topology realism.

An example of a dataset that fulfils these criterias is the CTU-13 dataset provided byStratosphere IPS, captured by CTU University, Czech Republic [20, 10]. The dataset issplit into thirteen different scenarios each having traffic from botnet attacks mixed with

6

2 Background

normal network traffic, in bidirectional format. The datasets were captured at router-level in a laboratory environment with malware running on a subnetwork of infectedhosts. A downside of these datasets is that all traffic from the botnets are labelled ashostile, even though some of it might be harmless. The CIDDS dataset is another datasetwhich was purposefully modelled after the above criterias [31] and is similar to theCTU-13 dataset.

2.3 Neural networks

A neural network is a type of machine learning model loosely modeled after the neu-rons in the brain [6, p.8-19]. A neural network consists out of multiple layers whichare stacked together and these layers contain the neurons, often called nodes. A nodeis basically a function with a set of parameters, often called weights [25, p.565-566].When training a neural network, it is the nodes’ parameters that are updated accordingto some algorithm in order to make better predictions. The layers of the model can beseparated into three different types, input layer, output layer and hidden layer. The inputlayer is the entrypoint of the network, essentially where data enters the model. The out-put layer is the final layer of the model, its output resulting in the model’s predictions.Between the input and output layers are the hidden layers, a neural network with manyhidden layers is often called a deep neural network. Data flows through these layersin the network, from input layer to hidden layer(s) and finally to the output layer, eachlayer applying its own set of functions to the data before passing it to the next layer.Neural networks come in many different forms of varying complexity, and a neuralnetwork with many hidden layers or neurons can often produce more accurate resultsat the cost of more parameters to be updated. This can cause the model to be moredifficult to optimize for a certain task, or even cause the network to over-optimize, ba-sically memorizing its input, rendering it useless for making predictions on previouslyunseen data. On the other hand, a shallow neural network with few layers or nodes oftenproduces worse predictions. It can therefore be difficult to find a network architecturewhich makes the best predictions possible and avoids memorizing its input. A visualillustration of a neural network can be seen in Figure 2.

2.4 Word2Vec

Word2Vec is a type of shallow neural network with one hidden layer originally devel-oped by Google that produces word embeddings for text data which can improve thequality of NLP algorithms by extracting meaningful semantic relationships in densevector representations [22, 23, 32].

7

2 Background

Figure 2: Neural network example illustration. Input data is passed into the input layerand through each of the hidden layers and finally the output layer to produce a predic-tion. Each layer contains nodes which apply functions on the data before passing it tothe next layer.

For a neural network to process text data, it must first be transformed into numeric vec-tors. This is typically done using tokenization, where characters, words or sentences aremapped to numerical tokens (scalar values) [6, p. 178-195]. Traditionally, one-hot en-coding (illustrated in Figure 3) is a common approach used for turning text tokens intovectors, where each token in the text is associated with a corresponding index in the vec-tor. If performing tokenization on word-level, this means that for each word-token it willhave a ”1” in its corresponding index and a ”0” for every other index. This approachdoes not retain any relation between sentences or words and produces very large andsparse vectors. Word2Vec on the other hand produces word embeddings which are N -dimensional vector representations of words where N is the number of nodes in the net-work’s hidden layer [22]. The network is typically trained with a task such as predictinga word given a sentence, or vice versa. After training, the actual network is discarded,instead the weights of the network’s hidden layer are extracted. These weights produceword embeddings, or word vectors, where each word has a corresponding vector. Sincethe network is trained to predict a surrounding sentence or word in a sentence, with alarge enough corpus, similar words will likely appear in similar contexts. The hope ofthis approach is that similar words will have word vectors close to each other, creatingcompact representations of words. Compared to sparse one-hot encoding vectors whereeach word adds a dimension, word embeddings have much fewer dimensions with con-tinuous values. This results in a much more dense and feature rich representation wherethe quality of the word embeddings generally increases with the number of dimensions[23, 32]. A toy example of this rich representation can be seen in Figure 4.

8

2 Background

Figure 3: Illustration of using one-hot encoding on a dataset with one feature whichhas six different classes. As can be seen this results in a new dataset with six differentfeatures.

Figure 4: Simple vizualiation of a three-dimensional word embedding, showing therelationship between genders and sibling type.

9

2 Background

2.4.1 Skip Gram model

The Skip Gram model is a type of Word2Vec model where the network is trained, givena word, to predict its context, the surrounding K words, where K is set manually [22,23, 32]. An extension of this technique is to use negative sampling which results inmuch more computationally efficient weight updates for the neural network, whilst stillmaintaining high performance. In this version the input to the model is both a wordand a context and the model is tasked to predict whether the word belongs to the givencontext. With this technique, words and contexts are sampled based on their frequencyin the given dataset, where positive samples are defined such that the word belongswithin the context and negative samples are words out of context, i.e. random wordsfrom the corpus. A comparison of the regular Skip Gram model and the Skip Grammodel with negative sampling can be seen in Figure 5.

Figure 5: The left part of this Figure illustrates the normal Skip Gram Word2Vec modelwhich takes as input a word w(t) and tries to predict the surrounding 2n context words.The right part illustrates the Skip Gram model with negative sampling. This model takesas input a word w(t) and a context word w(c) and tries to predict whether w(t) belongsto the same context as w(c)

2.5 Autoencoders

An Autoencoder is a type of unsupervised neural network that is trained to copy its inputdata to its output, i.e. reconstruct its input [11, 25, p. 493-516, p. 1004-1007]. The modelconsists out of two distinct parts, an encoding part and a decoding part as illustrated in

10

2 Background

Figure 6. The encoding part consists out of the first few layers of the network whichtypically encodes the input to a lower dimension. The latter part of the network is thedecoding part, which decodes the previously encoded input back to its original size.In its most basic form, an Autoencoder has only one hidden layer with fewer nodesthan input features. This hidden layer encodes the input to a lower dimension and thenpasses the encodings to an output layer which projects them to the original input size.However, the encoding part can be done in several layers, for example reducing thenumber of dimensions successively. Similarly, the decoding part can also be done inmultiple layers, increasing the number of dimensions for each layer. These type oftechniques can allow for more accurate reconstructions of complex data [11, p. 493-516]. Due to the fact that Autoencoders most often have a bottleneck encoding layer,they can not learn to completely copy input data, providing perfect reconstructions.Instead they have to learn distinct feature representations of the input in order to makeaccurate reconstructions.

Figure 6: Example Autoencoder architecture which takes as input a k-dimensional vec-tor, encodes it through n layers, decodes it through m layers and finally produces ak-dimensional vector as output.

Similarly to Word2Vec, after training an Autoencoder, the actual model can be thrownaway and the encoding layer can be extracted in order to get an encoded representationof the input data for dimensionality reduction purposes [11, p. 493-516]. Another usecase is to calculate the difference between the original input and the reconstructed out-put, resulting in a reconstruction error value. The larger the difference between the inputand the output, the worse the encoder performed the reconstruction. This reconstruc-

11

2 Background

tion error can be used for anomaly detection systems, where an Autoencoder is trainedto reconstruct normal input data. The assumption for such a system is that anomalousdata points should be much more difficult to reconstruct, resulting in a large reconstruc-tion error [8, p. 30]. By setting threshold values for the reconstruction error, anomaliescan be defined as any data point which results in a reconstruction error larger than thethreshold value.

Autoencoders have been found successful in areas of anomaly detection where previ-ously supervised approaches have been used, such as determining healthy behaviour ofsupercomputers [4]. For datasets where clean data is not guaranteed, Robust Deep Au-toencoders are supposedly suitable [39]. This type of Autoencoder adds a filter layerwhich filters out inputs that are difficult to reconstruct, therefore not affecting the weightupdates and the ability to reconstruct normal inputs.

12

3 Related work

3 Related work

There are many examples of botnet detection approaches using different techniquesranging from pattern-based detection and supervised machine learning systems to un-supervised anomaly detection algorithms. Often these systems are part of a larger IDSwhich leverages different approaches [9]. A supervised system uses the patterns of ma-licious traffic in order to distinguish them from legitimate traffic. An anomaly detectionsystem on the other hand aims to understand the normal patterns of network traffic andalerts on anything that deviates from its perception of normality.

A common difficulty for network analysis is the topic of processing IP addresses andport numbers. Often aggregation based on IP addresses is performed in order to com-pute insightful statistic features. However, when dealing with NetFlow data which isalready an aggregation of PCAP data, valuable information can be lost. To process IPaddresses and extract meaningful information from them, expert knowledge is required,such as determining the hierarchical structure of IP addresses. Another approach is todo IP geolocation estimation, which converts IP addresses to longitude and latitude co-ordinates leveraging techniques such as neural networks [15, 13]. With IP geolocationestimation, distance measurements can be made when comparing IP addresses, how-ever the problem with parsing port numbers is still present. In Sections 3.3 and 3.4two other approaches for extracting information from ip addresses and port numbers aredescribed.

Another difficulty when comparing different botnet detection approaches is discrepancyof metrics and datasets used which makes it difficult to evaluate model performance.Commonly used metrics (explained in detail in Section 4.5) include True Positive Rate(TPR) (also known as Recall), False Positive Rate (FPR), Area Under Curve (AUC),Precision, Accuracy and F1 (the harmonic mean of Precision and Recall), however,information is often lost when using a single metric for evaluation. Therefore, it isimportant to use a combination of metrics which complement each other. Section 3.1describes some network intrusion approaches done using supervised techniques andSection 3.2 describes unsupervised techniques. Some of these approaches were evalu-ated on undisclosed datasets which makes it increasingly difficult to assess the qualityof the results.

3.1 Network intrusion detection using classification models

Botnet detection is commonly approached in a supervised manner, using machine learn-ing models in order to detect botnet flows. The results of the approaches mentioned in

13

3 Related work

this section can be seen in Table 1.

Multiple neural network architectures have been used for different forms of networkattack detection, these range from shallow one-layer networks to more complex deepLong short-term memory (LSTM) architectures. An LSTM neural network is a typeof network which is specialized on sequential data, using information from previoustimesteps as inputs to the current timestep [11, 25, p. 363-408, p. 570-571]. In a sensethey have a form of memory which allow them to capture time sensitive information overlong sequences [6, 29, p. 196-223, p. 537-549]. Idhammad et al. [14] employed a neuralnetwork with a single hidden layer for classifying Denial of Service (DoS) attacks. Themodel was evaluated on two publicly available datasets, the NSL-KDD dataset, basedon the KDD dataset from 1999, was used, which contains four types of DoS attacksand a total of 148.517 samples. UNSW-NB15 was the other dataset used, generated in2015, which has nine different attack types and 257.705 samples. Moradi et al. [24]similarly used a neural network with two hidden layers for detecting two different typesof attacks on the DARPA public dataset from 1999. Also using the DARPA dataset, Tranet al. [36] used a neural network for real time network attack detection. With a morecomplex architecture, an LSTM neural network, Lin et al. [18] detected unauthorizeddevices on undisclosed corporate datasets.

Decision tree and Random forest models have also been found useful for network at-tack detection. A Decision tree is a tree-like model which branches into different (oftennested) outcomes based on threshold values [25, p. 546-547]. It contains conditionalstatements based on these values and are therefore easy to interpret as one can tell ex-actly why a certain choice was made. A Random forest is an ensemble of Decisiontrees, which are often small and trained on separate features in order to avoid correla-tion between trees [25, p. 552-553]. A Random forest uses the collective prediction ofits Decision trees in order to come up with a final prediction. Beigi et al. [1] mergedthree publicly available datasets in order to create a single synthetic for botnet detectionusing a Decision tree, with sixteen different types of botnets. Similarly, Bilge et al. [3]used a Random forest for botnet detection on an undisclosed dataset. Also using a Ran-dom forest, Stevanovic et al. [34] performed botnet detection on the publicly availableISOP dataset from 2005, which contains three different types of botnets. Li et al. [17]compared SVMs and Decision Trees for determining host roles in undisclosed campusnetworks and found SVMs superior for the task. An SVM is a type of model whichconstructs a set of hyperplanes in order to separate classes as distinctly as possible bymaximizing the distance between data points [25, p. 498-504]. Najafabadi et al. [26]compared three different machine learning models for SSH brute force attack detectionon an undisclosed campus network. Decision trees, K-Nearest Neighbour (K-NN) andNaive Bayes were compared. K-NN is a memory-based model which looks at the Knearest points in its memory dataset that are closest to input X and classifies X as the

14

3 Related work

most frequent class in the nearest K points [25, p. 16-17]. Naive Bayes on the otherhand applies Bayes’ theorem with the naive assumption that there is conditional inde-pendence between every pair of features and the label, which often is not the case [25,p. 84]. In another paper, Najafabadi et al. [27] compared Decision trees and K-NN forDoS detection on the publicly available SANTA dataset from 2015.

Source Model Score DatasetIdhammad et al. [14] Neural network 99% TPR NSL-KDDIdhammad et al. [14] Neural network 97% TPR UNSW-NB15

Moradi et al. [24] Neural network 90% Accuracy DARPALin et al. [18] LSTM neural network 0.99 AUC Undisclosed

Tran et al. [36] Neural network 5.14% FPR99.92% TPR DARPA

Beigi et al. [1] Decision tree 68% TPR3% FPR

Merge of ISOT,ISCX 2012 IDS

and CTU-13

Bilge et al. [3] Random forest 65% TPR1% FPR Undisclosed

Stevanovic et al. [34] Random forest 95.73% TPR0.9596 F1 ISOP

Li et al. [17] SVM 96% Accuracy UndisclosedLi et al. [17] Decision Tree 99% Accuracy Undisclosed

Najafabadi et al. [26] Decision tree 0.9965 AUC UndisclosedNajafabadi et al. [26] K-NN 0.9946 AUC UndisclosedNajafabadi et al. [26] Naive Bayes 0.9966 AUC Undisclosed

Najafabadi et al. [27] Decision tree0.9988 AUC98.73% TPR0.0282% FPR

Undisclosed

Najafabadi et al. [27] K-NN0.9999 AUC98.83% TPR0.0316% FPR

Undisclosed

Table 1: This table shows the results of the different supervised intrusion detectionapproaches described in Section 3.1

3.2 Network intrusion detection using anomaly detection models

Similarly to supervised approaches, unsupervised models come in various forms. Theresults of the approaches mentioned this section can be seen in Table 2.

15

3 Related work

Mathur et al. [21] used a hybrid approach combining anomaly and classification de-tection models to determine whether outgoing connections from internal IP addresseswere malicious. The approach used both publicly available lists with known maliciousIP addresses and an unsupervised similarity model which was deployed to determinewhether new IP addresses were similar to the blacklisted IP addresses. The authorsused this combination of the unsupervised model with the blacklist for botnet detectionon an undisclosed dataset. Radford et al. [28] utilized an LSTM neural network in orderto predict flow sequences in an unsupervised manner where poorly predicted sequenceswere classified as anomalous. This was done using NLP techniques similar to that ofWord2Vec, where flows were encoded into tokens that form sentences between comput-ers. The authors used this approach to detect four different network attacks includingbotnet attacks, on the publicly available ISCX IDS dataset from 2012. Winter et al. [38]applied a one-class Support Vector Machine (SVM) which is an unsupervised modifica-tion of a traditional SVM that only predicts whether a data point is normal or an outlier.It is trained oppositely to other forms of anomaly detectors which are usually trainedto learn the patterns of normal behaviour, instead, the model is only trained on outliers,i.e. malicious flows. The authors evaluated the model on an undisclosed dataset withmultiple different attack types. Terzi et al. [35] used K-means, a model which attemptsto group data into different clusters based on proximity [25, p. 354], on the 10th sce-nario of the CTU-13 dataset in order to separate normal flows from botnet flows into twodifferent clusters. Also using the CTU-13 dataset, Wang et al. [37] used a graph-basedunsupervised model for botnet detection on five different scenarios in the dataset.

3.3 Flow2Vec

Flow2Vec is a model for applying Word2Vec to network flows, such as NetFlow [12]. Ittreats individual flows as words from the Word2Vec algorithm and surrounding words(flows) as its context, thus a sequence of flows form a sentence. The approach tokenizeseach individual flow so that they are all represented as a single numerical token. Fromthese tokens, a Word2Vec Skip Gram model with negative sampling is trained to createcontinuous vector representations of each flow. With this approach, since each individ-ual flow is tokenized, no relationship between IP addresses or port numbers is preserved.Even if both the IP addresses and port numbers are identical between flows, if the pro-tocol (or any single one of the features) differ, they will be treated as two completelydifferent flows unless they appear within each others context (close in time). By usingFlow2Vec, one can create vector representations of flows which can then be fed into amachine learning model or other systems that can use numerical inputs. However it isdiscussable how useful this since no relationship between IP addresses or port numbersis preserved.

16

3 Related work

Source Model Score DatasetMathur et al. [21] Hybrid model 92% TPR Undisclosed

Radford et al. [28] LSTMneural network 0.84 AUC ISCX IDS

Winter et al. [38] One-class SVM 0% FPR98% Accuracy Undisclosed

Terzi et al. [35] K-means 99.6% Accuracy27.6% FPR

10th scenarioof CTU-13

Wang et al. [37] Graph-model

1.7% FPR7.7% TPR

86% Precision0.14 F1

1st scenarioof CTU-13


2.2% FPR4.6% TPR


2nd scenarioof CTU-13


11% FPR12% TPR




4% FPR4.4% TPR




3.4% FPR8% TPR



Table 2: This table shows the results of the different unsupervised intrusion detectionapproaches described in Section 3.2

17

3 Related work

3.4 IP2Vec

IP2Vec is a model heavily inspired by Word2Vec and Flow2Vec for finding similaritiesbetween IP addresses [30]. Its main concept relies on the assumption that IP addressesare similar if they appear in similar contexts, where context is defined as other net-work features in the same flow such as port numbers and protocol, and the words inWord2Vec are the IP addresses. Just like Word2Vec maps words to continuous vectors,IP2Vec creates continuous vector representations for IP addresses. Traditional distancemeasures which have difficulties calculating distances with features of both categoricaland continuous values are thus able to operate solely on continuous values when appliedon flow or packet data. By using these distance measurements, close IP addresses in thecontinuous vector space imply that the IP addresses are similar. As the authors state,multiple IDSs exist which either discard IP addresses, applies domain knowledge or usesaggregation methods in order to extract features from them. IP2Vec is able to extractmeaningful representations of IP addresses without aggregation or domain knowledge.

In the paper, Ring et al. [30] train the IP2Vec model on two publicly available datasets,the 9th scenario of the CTU-13 dataset and their own CIDDS-001 dataset, both of whichcontain botnet attacks. In the paper, the model is trained on source and destination IPaddresses, source and destination port numbers and the protocol of each flow, however,the authors only extract vector representations for the IP addresses. The model wastrained for ten epochs with an embedding layer of size 32, resulting in a 32-dimensionalvector representation for each IP address. The authors found these continuous vectorrepresentations of IP addresses to be superior for botnet detection compared to moretraditional statistical methods for detecting anomalous IP addresses. Several advantagesand disadvantages of the proposed model were acknowledged, with the main advantagebeing that continuous vector representations of categorical network data allows for moredata mining and machine learning models to be applied, to more data. However, a disad-vantage is that the behaviour of an IP address can change over time, which could causeproblems for the model and previously unseen IP addresses do not have continuous vec-tor representations available. To solve the latter problem, the model could be retrainedregularly to account for new IP addresses or some default vector representation couldbe applied which is the mean of all the known IP addresses.

18

4 Methods

4 Methods

This section describes the methods used for tackling the problem described in Section1.1. This involves a process with multiple steps, including data selection and feature ex-traction followed by the proposed machine learning models used for anomaly detectionand evaluation criteria for interpretation of given results.

4.1 Data selection

The datasets chosen for training and evaluating the models are scenarios 2 through 13from the CTU-13 dataset, scenario 1 is excluded because of unavailability of access.As described in Section 2.2.1, these are publicly available NetFlow datasets from realnetworks with both normal network traffic and botnet traffic which were found to (unlikemany others) fulfill the criterias of a good network dataset by Malowidzki et al. [20].For the purpose of training and evaluating, the botnet traffic is treated as anomalous andeverything else as normal, as done by the different approaches in Section 3.2. Table 3shows the distribution of normal and bot flows in each scenario and the botnet used.

Scenario Total flows Normal flows Bot flows2 1808122 1787181 (98.84%) 20941 (1.16%)3 4710638 4683816 (99.43%) 26822 (0.57%)4 1121076 1118496 (99.77%) 2580 (0.23%)5 129832 128931 (99.31%) 901 (0.69%)6 558919 554289 (99.17%) 4630 (0.83%)7 114077 114014 (99.94%) 63 (0.06%)8 2954230 2948103 (99.79%) 6127 (0.21%)9 2087508 1902521 (91.14%) 184987 (8.86%)

10 1309791 1203439 (91.88%) 106352 (8.12%)11 107251 99087 (92.39%) 8164 (7.61%)12 325471 323303 (99.33%) 2168 (0.67%)13 1925149 1885146 (97.92%) 40003 (2.08%)

Table 3: Scenarios from the CTU-13 dataset used for running experiments on.

4.2 Feature extraction

From each dataset, seven features are extracted for each flow, namely flow duration,total packets, total bytes, bytes per second, bytes per packet, packets per second and

19

4 Methods

time since last flow, described below.

Flow Duration Flow duration is the total time between the first packet sent within aflow and when the last packet was received, measured in seconds.

Total packets Total packets is the total number of packets sent bidirectionally withina flow.

Total bytes Total bytes is the total number of bytes sent bidirectionally within a flow.

Bytes per packet Bytes per packet is the total number of bytes divided by total num-ber of packets.

Bytes per second Bytes per second is the total number of bytes divided by the flowduration.

Packets per second Packets per second is the total number of packets divided by theflow duration.

Time since last flow Time since last flow is the duration between when the previousflow was last seen and current flow was first seen.

The features are also standardized using

nX

i=1

xi � ui

si

where n is the number of flows, xi is the ith flow of feature x, u is the mean value offeature x, and s is the standard deviation of feature x. This is done so that the featureswill cause similar reconstruction differences when comparing input and outputs for theAutoencoder model.

20

4 Methods

4.3 System structure

The proposed anomaly detection system structure consists out of several parts, whichcan broadly be separated into a preprocessing part and an anomaly detector part. Thefirst part, used for preprocessing, is a Word2Vec model, trained on features similarto that of IP2Vec as described in Section 3.4. This allows for vector representationsof IP addresses, port numbers and protocols to be used as input features for a secondmodel, which Ring et al. [30] found superior to other approaches which often performsome form of aggregation based on these features instead, or completely discards them.This also avoids the requirement of expert knowledge in order to extract meaningfulinformation from these features, which could be network specific.

The embeddings produced by the Word2Vec model in conjunction with the extractedfeatures from Section 4.2 are then fed into an anomaly detector. Due to the complexityof the problem, with almost a hundred features for each flow, an Autoencoder model isused as an anomaly detector, which Ted Dunning [8, p. 30] claim to be suitable for suchdifficult tasks. The Autoencoder is fed word vectors and extracted features as inputand attempts to reconstruct each flow. The assumption is that an anomalous botnetflow should be much more difficult to reconstruct for the model, compared to a normalflow. This allows for one-dimensional thresholds to determine whether a flow is to beconsidered anomalous or not, where a large error (big difference between input andreconstruction) is considered anomalous. A visualization of the system structure can beseen in Figure 7.

Figure 7: This Figure illustrates the general system structure of the proposed solution.It takes as input the NetFlow data, features are extracted and a Word2Vec model is thentrained to provide vector representations of the IP addresses, ports and protocols. Thesefeatures and vector representations are then fed into an Autoencoder which attempts toreconstruct its input.

21

4 Methods

(a) Word2Vec model

(b) Autoencoder model

Figure 8: (a) shows the Word2Vec model used, which has an embedding layer size of16. (b) shows the Autoencoder model used, which has an input and output layer size of87, the 5 hidden layers have size 64, 32, 16, 32 and 64 respectively.

4.3.1 Word2Vec model

The chosen Word2Vec model is a Skip Gram version with negative sampling, due toits high performance and computational efficiency as described in Section 2.4. As pre-viously mentioned, this model takes a word and a word context as input and predictswhether the given word belongs to the given context, an illustration of which can beseen in Figure 8a. Before training the model, the flow data is preprocessed similarly tothat of IP2Vec, namely from each flow, the source and destination IP addresses, sourceand destination port numbers and protocol are extracted. These are then tokenized intonumerical tokens so that they can be fed into a machine learning model. Unlike theoriginal Word2Vec model which has a context window which can be arbitrarily chosen(often as the length of a sentence) [22, 23], each flow forms its own context and canbe interpreted as a sentence. From the tokenization of flow features, word-context pairsare generated for training the Word2Vec model. The word is one of the above extractedflow features and the context is another feature from the dataset. Each pair has a la-bel which indicates whether the word belongs to the pair’s context or if it is a negativesample. Thus, the word-context label generation process involves creating true samplesfor each combination of features within a given flow, as well as creating twenty fivenegative samples for each true feature, where a negative sample is a sample where theword and context are taken from two separate flows. The full process, from flow to wordembeddings can be seen in Figure 8a.

Finally, after generating word-context pairs, the model can be trained and afterwards,the model’s predictions are discarded, instead the embeddings from the model’s hiddenlayer are extracted. An embedding size of sixteen dimensions is used, half of what

22

4 Methods

was used for IP2Vec [30], which could result in a less rich embedding representation.However, unlike in IP2Vec, the port and protocol embeddings are not discarded, whichmeans that a richer representation from each flow can be extracted, even though eachindividual embedding vector is smaller. These embeddings provide a word-to-vectormapping, giving each word a sixteen-dimensional numerical vector representation. Foreach flow the five extracted features can each be replaced by a sixteen-dimensionalvector that can be used as numerical features for the Autoencoder model.

4.3.2 Autoencoder model

Combining the features extracted from each flow, as described in Section 4.2 withthe vector representations of the IP addresses, port numbers and protocols from theWord2Vec model results in a total of 87 features which are used as input for the Au-toencoder model. The Autoencoder itself is a neural network, as described in Section2.5 with five hidden layers. The first three hidden layers create the encoding part of thenetwork, which encodes the input vectors to a lower dimensional space whilst tryingto maintain the same feature representation. This encoding part is done successively,the encoding layers having 64, 32 and 16 nodes respectively. The final two hidden lay-ers create the decoding part of the network, responsible for decoding the input from itslower dimensional transformation into its original size. This is done similarly to theencoding part, with the decoding layers having 32 and 64 nodes respectively. The finaloutput layer has 87 nodes, the same amount as the input layer. A visualization of themodel’s architecture can be seen in Figure 8b. The model is trained by minimizing themetric Mean Squared Error (MSE) defined in Equation 1 when reconstructing its input.In this equation, n is the number of input features for the model, Yi is the ith input fea-ture and Yi is the Autoencoder’s reconstruction of the ith input feature. A small MSEvalue indicates a reconstruction that closely resembles the input.

MSE =1

n

nX

i=1

(Yi � Yi)2 (1)

4.4 Experiments

Each of the scenarios listed in Table 3 are run through the same experiment, which isdescribed below.

1. Features are extracted from each flow as described in Section 4.2.

23

4 Methods

2. A Skip Gram Word2Vec model with negative sampling is trained as described inSection 4.3.1.

3. For each flow, the IP addresses, port numbers and protocol are replaced by theembedding vector representations from the Word2Vec model.

4. The botnet flows are separated from the normal flows, into two different datasets,dataset X containing the normal flows and dataset Xbot containing the botnetflows.

5. Dataset X is shuffled and split into two different parts, one with 90% of the data,called Xtrain, and the last 10% into a dataset called Xtest. The combined numberof flows for Xtest and Xbot can be seen in 4.

6. An Autoencoder model is trained on Xtrain as described in Section 4.3.2.

7. After training, the Autoencoder is used to make predictions (reconstructions) onXtest, resulting in dataset Xtest pred and the same is done for Xbot, resulting inXbot pred.

8. Finally, the reconstruction error is computed between dataset Xtest and Xtest pred

as well as between Xbot and Xbot pred for comparison.

As can be seen in step 6 in the above list, the Autoencoders are only trained on thenormal flows, not botnet flows. As described in Section 2.5, this allows the Autoencoderto learn to reconstruct normal input data well, without being affected by outliers, whichshould be difficult to reconstruct. To see that the model does not just memorize theflows it was trained on, the normal flows are separated into a train and test set, Xtrain

and Xtest, where the training set is used for training the model and the test set is usedfor evaluating the model together with the botnet flows, Xbot.

4.5 Evaluation

For evaluating the system, three main methods are used, the Precision-Recall curve,the AUC score and a visualization of the reconstruction error histogram of the test andbotnet data.

The Precision-Recall curve illustrates the Precision and Recall scores for different thresh-olds [25, p. 184-185]. In this setting, the threshold is based on the MSE values. Pre-cision can be thought of as how large percentage of the marked anomalies are actu-ally anomalies, A high score indicates that most of the anomalies marked are indeed

24

4 Methods

Scenario Total flows Normal flows Bot flows2 199660 178719 (89.51%) 20941 (10.49%)3 495207 468385 (94.58%) 26822 (5.42%)4 114430 111850 (97.75%) 2580 (2.25%)5 13795 12894 (93.47%) 901 (6.53%)6 60059 55429 (92.29%) 4630 (7.71%)7 11465 11402 (99.45%) 63 (0.55%)8 300939 294812 (97.96%) 6127 (2.04%)9 375242 190255 (50.70%) 184987 (49.30%)

10 226697 120345 (53.09%) 106352 (46.91%)11 18073 9909 (54.83%) 8164 (45.17%)12 34499 32331 (93.72%) 2168 (6.28%)13 228520 188517 (82.49%) 40003 (17.51%)

Table 4: Flow count for each scenario’s evaluation dataset, a combination of Xtest andXbot

anomalies. The score tells nothing about anomalies that are marked as normal, there-fore anomalies can go unnoticed by the model and still produce a high Precision score.It is calculated using the True Positives (TPs) and False Positives (FPs) of a system’sprediction. In this case, TPs is the number of anomalous data points correctly marked asanomalous by the system and FP is the number of normal data points wrongly markedas anomalous by the system. How the Precision score is calculated can be seen in Equa-tion 2. Recall or TPR, is complementary to Precision and tells how large percentageof the actual anomalies are marked as anomalies. As such, a high score indicates that alarge part of the anomalies are correctly identified, however it does not take into accounthow many of the marked anomalies are in fact normal. Both Tran et al. [36] and Wanget al. [37] note that for botnet detection, Recall is a much more important metric forevaluation than Precision, since botnet flows may cause serious harm if gone unnoticed,whilst false alarms (measured by Precision) are more tolerable. Recall is calculated us-ing TPs and False Negatives (FNs), where FNs is the number of anomalous data pointsincorrectly marked as normal by the system. Equation 3 shows how Recall is calculated.Looking at the Precision-Recall curve gives an insight to where the best placement of athreshold would have been in order to achieve certain Precision and Recall scores.

Precision =TP

TP + FP(2)

Recall (TPR) =TP

TP + FN(3)

25

4 Methods

Similarly to Precision-Recall curves, the AUC score is also based on computing scoresfor different thresholds of a value. In the context of machine learning, AUC most com-monly refers to the Area under the Receiver Operating Characteristic (ROC) curve [25,p. 183-184]. The ROC curve illustrates the tradeoff between TPR and FPR for a setof thresholds, where FPR can intuitively be thought of how large a percentage of allthe normal flows are mistakenly marked as anomalous by the system. It is thereforedesirable to have as low FPR as possible and it is calculated using the FPs and TrueNegatives (TNs), where TNs is the number of normal data points correctly marked asnormal by the system.

The calculation for computing FPR can be seen in Equation 4 AUC is used to summarizethe ROC curve into a single score between 0 and 1. A high score is better and indicatesa low tradeoff between the TPR and the FPR, however, the score is not well suited forimbalanced datasets, where one class label occurs much more frequently than another.In this case, the Precision-Recall curve gives a more representative view of how well amodel predicts different classes.

FPR =FP

FP + TN(4)

26

5 Results

5 Results

This section describes the results from the experiments described in Section 4.4 for eachscenario in Table 4 based on the evaluation criteria listed in Section 4.5, namely:

1. Reconstruction error visualization using histograms.

2. Precision-Recall curves for moving thresholds for botnet detection.

3. AUC scores.

These results are based on the Xtest and Xbot datasets for each CTU-13 scenario used,explained in Section 4.4.

5.1 Flow reconstruction error

Figure 9 and 10 visualize the reconstruction error histograms for each individual dataset.To make the histograms easier to interpret and see the difference of reconstruction errorbetween the regular flows Xtest and botnet flows Xbot, the histograms are normalizedwith regards to height, such that the botnet flows and regular flows are of equal height.The actual distribution between flows for each scenario can be seen in Table 4. Forscenarios 9-11, the distribution is almost equal between botnet flows and normal flows,meaning that for these scenarios the histogram plots closely resemble reality, for otherscenarios, the normal flows are many more than the botnet flows. As can be seen inFigure 9, botnet and normal flows seem to be easily separable in Scenarios 2, 3, 4 and6, however Scenarios 5 and 7 are not as easily separable. Similarly, all Scenarios (8-13) in Figure 10 seem to be easily separable by reconstruction error. There does notseem to be a single value for all scenarios which can be used as threshold for separatingthe botnet flows from normal flows perfectly. For scenarios 3, 4, 6 and 10 a thresholdbetween 0.15 and 0.2 seem to be possible as a separator, 0.1 could work for scenarios 8,9 and 10 and possibly 0.3 for scenario 11 and 12. Scenarios 2, 5 and 7 all have very lowreconstruction errors compared to the other scenarios and a threshold that successfullyseparates botnet flows from normal flows in these scenarios would have to be placedaround 0.04 and 0.075 when looking at the histograms.

These visualizations can provide insight to whether there seems to be a possibility ofdistinguishing between botnet and normal flows using the reconstruction error metric,however, to actually assess this possibility, it can be useful to look at other metrics aswell.

27

5 Results

(a) Scenario #2 (b) Scenario #3

(c) Scenario #4 (d) Scenario #5

(e) Scenario #6 (f) Scenario #7

Figure 9: Histogram plots of the reconstruction error (MSE) for Scenarios 2 through7. The blue areas show the histogram of the regular flows and the red areas show thehistogram of the botnet flows.

28

5 Results




Figure 10: Histogram plots of the reconstruction error (MSE) for Scenarios 8 through13. The blue areas show the histogram of the regular flows and the red areas show thehistogram of the botnet flows.

29

5 Results

5.2 Precision-Recall curves

Figure 11 and 12 visualize the Precision-Recall curves for each individual scenario.These figures illustrate the Precision and Recall scores for varying thresholds of thereconstruction error. This threshold can be thought of as putting a line that separatesthe histograms in Figure 9 and 10. However, scores are calculated without normalizingthe scenarios, resulting in more representative visualizations. The best performance canbe seen in scenarios 6 and 9-13 which are all able to maintain a Precision score of 90-100% with a similar Recall score, resulting in almost no tradeoff between finding allbotnet flows and avoiding false alarms (normal flows which are marked as anomalous).This means that a threshold could be placed in such a way that it separates the normalflows from the botnet flows which would produce very few FPs and FNs and a largeamount of TPs. This is comparable to the scores presented by other approaches inTable 1 and 2, and much better than what Wang et al. [37] achieved on scenarios 1-2, 6 and 8-9. However, it is important to note that the Precision-Recall curves lookat multiple threshold values, where the approaches in Section 3 do not (except whenpresenting AUC scores). One way of comparing these results would be to think of theapproaches in Section 3 as evaluating their systems based on a single thresholds, whilethe Precision-Recall curves in Figure 11 and 12 evaluate model potential across multiplethresholds, resulting in an unfair comparison.

Scenarios 2-4 each produce decent results, showing no real tradeoff between Precisionand Recall, however they do not produce as high Precision scores, which would resultin a lot of false alarms. Scenarios 5 and 8 produce much worse Precision scores whichwould indicate more false alarms than true alarms. Finally, the approach produces theworst results on scenario 7, meaning that from the reconstruction error metric alone,botnet flows and normal flows are inseparable. In order to set a threshold which correctlymarks botnet flows as anomalous for this scenario, one would have to accept an FPR ofnearly 100%.

5.3 AUC scores

Table 13 show the resulting AUC scores for each scenario. As can be seen, almost allof them have an AUC score of roughly 0.98, comparable to the AUC scores achievedby the supervised methods that can be seen in Table 1. Worse scores are achieved forscenario 5 with a score of 0.91 and scenario 7 with 0.83. This indicates that for mostof the scenarios, there is a low trade off between FPR and TPR (Recall). However,as stated in 4.5 the score is not suited for imbalanced datasets, which means that forscenarios 2-8 and 12-13, the Precision-Recall curves are more representative of model

30

5 Results




Figure 11: Precision-Recall curve plots part for Scenarios 2 through 7.

31

5 Results




Figure 12: Precision-Recall curve plots part for Scenarios 8 through 13.

32

5 Results

performance. This can be especially seen for scenario 7 which achieves a relativelyhigh AUC score of 0.83, comparable to what Radford et al. [28] achieved with anLSTM neural network approach. However when looking at the Precision-Recall plot inFigure 11f, the Precision and Recall scores indicate poor performance on scenario 7.

Figure 13: AUC scores for each scenario

33

6 Discussion

6 Discussion

In this section, the methods used and the results achieved are discussed, with specialregard to the following three points:

1. How generalizable is the system? I.e. how well do the results from one experi-ment transfer to another and are there some general rules that could be gatheredfrom the different experiments?

2. Possible alterations to the architectures of both the Word2Vec model and Autoen-coder model.

3. The selected datasets, specifically how useful they are for evaluating the methods.

6.1 Generalizability of the system

As can be seen in Section 5, the results of the proposed system varies between datasets.This raises the question why the botnet flows from some datasets are so much easier todistinguish from normal flows than in other datasets. It would for example make senseto look more closely at scenario 7, on which the system performed the worst whenlooking at both the AUC scores and Precision-Recall curves.

Additionally, from looking at the histogram plots of the reconstruction error in Figure9 and 10, with MSE values ranging from 0 to 1.5 it is clear that no static thresholdcan be chosen which could satisfyingly separate normal flows from botnet flows inall scenarios. A threshold of 1.2 would almost completely separate botnet flows fromnormal flows in Scenario 10, but for every other scenario, the same threshold would notbe able to make any distinction between botnet flows and normal flows. A more cleverway of setting a threshold which is generalizable across datasets would be required todeploy the method successfully in an unsupervised manner. A possible solution forthis is to use percentiles of the training set’s reconstruction error and then evaluate thesystem using a percentile-based threshold on the test and botnet datasets. This idea isfurther discussed in Section 7.1.1.

6.2 Model architectures

The chosen model’s, Word2Vec and Autoencoder both have hundreds of different possi-ble parameters and architecture settings that could be tuned and altered. Some examplesof the possible alterations that could be made include:

34

6 Discussion

1. Changing the size of the Word2Vec model’s embedding layer, providing largeror smaller vector representations. The Word2Vec model used has an embeddinglayer size of 16, half of which was used for IP2Vec [30]. It is possible that thissize is too small to capture the similarities in IP addresses, ports and protocols suf-ficiently. It could also be too large, making it difficult to measure how dissimilartwo vector representations are due to the large amount of values.

2. Increasing or decreasing the number of hidden layers in both the encoding anddecoding parts of the Autoencoder for better or worse reconstructions of the flows.Since the system only looks at how well it is able to distinguish botnet flows fromnormal flows it doesn’t matter how well the Autoencoder is able to reconstruct itsinput. What matters is that it should be sufficiently worse at reconstruction botnetflows such that a threshold can be set. This means that it could be interesting tolook at both more complex and architectures and simpler architectures.

3. On the same note as the previous item, the size of the Autoencoder’s hidden lay-ers could be altered for similar effects as altering the number of hidden layers. Alarger number of nodes in the hidden layers would result in a smaller compressionof the input features, which would likely allow the Autoencoder to better recon-struct its input. However, this could also result in the Autoencoder getting toogood at reconstructing inputs, resulting in similar reconstruction errors for bothnormal and botnet flows. Similarly, fewer number of nodes could be used for anincreased compression. Clearly, it is a difficult problem to solve whether to in-crease or decrease complexity by altering the number of nodes in each layer orthe number of layers in the model.

6.3 Dataset selection

The datasets used for evaluating the proposed system were determined to be of highquality according to the criterias set up by Malowidzki et al. [20]. One of these criteriasclaim that for a dataset to be good and representative of real networks, it should berecent. The CTU-13 datasets were created in 2014 [10], which means that they arefive years old at the time of these experiments. With the ever evolving network trafficand especially new forms of attack patterns for traditional attacks as well as botnetattacks, it is questionable whether this dataset still holds up to the previously mentionedcriteras.Whether they still are to be considered recent and representative of real, modernnetworks is therefore discussable and could be investigated. One way to circumvent thisissue at least slightly would be to also use other datasets, such as the CIDDS dataset,briefly mentioned in Section 2.2.1 which was modeled after the CTU-13 datasets butgenerated at a later time [31].

35

6 Discussion

The CTU-13 datasets are also very well labeled, with each flow being perfectly markedas either a botnet flow or a normal flow. In a real setting this would likely not be thecase, or would cost too many resources in order to get perfect labels. Therefore it couldbe interesting to review the dataset on another dataset, with worse labeling, or fabricateworse labeling somehow, to see how the proposed approach would perform in such asetting.

36

7 Conclusion

7 Conclusion

In conclusion:

1. An Autoencoder was used for reconstructing network flow data. From these re-constructions, the MSE was calculated to determine the reconstruction error, i.e.quantifying how well the Autoencoder managed to reconstruct each flow.

2. The reconstruction error was inspected to view the difference between how wellthe model managed to reconstruct normal network traffic and botnet traffic, withthe hope of finding a clear distinction between the ability of reconstructing normaltraffic compared to botnet traffic.

3. The Autoencoder used multiple features as input from each flow, including statis-tical features computed based on flow properties and 16-dimensional continuousvector representations of IP addresses, port numbers and protocols.

4. The vector representations were computed using a Word2Vec model, trained tofind similarities between flow features, similar to how NLP techniques are usedto find related words in sentences.

The results, which evaluate the potential of the above approach for botnet detectionshow that the reconstruction error can indeed be used as a metric for botnet detection.However, performance varies between datasets and can likely not be used as the singlemetric in a real setting for complete botnet detection, while maintaining a low falsealarm rate. The technique produces similar results to that of previously used approachesand show that there is potential for this technique which could be incorporated into othersystems or be further developed.

7.1 Future work

This section discusses some of the possible future additions that could be implementedto improve the proposed system. Three main possible extensions are discussed:

1. Determining a good method for placing a threshold based on the reconstructionerror that is used for separating normal flows from botnet flows. As describedin Section 4.5, the proposed system is evaluated on a number of thresholds inorder to determine the potential of using the reconstruction error as a metric forseparation.

37

7 Conclusion

2. More advanced Autoencoder architectures such as LSTM Autoencoders or Ro-bust Deep Autoencoders.

3. Combining the reconstruction error from the Autoencoder with other systems, orusing the reconstruction error as a feature for other models.

7.1.1 Threshold placement

The current solution is evaluated on many different thresholds as can be seen in Section5. To use the system in a real setting, a threshold has to be placed, which classifiesflows as anomalous or not. As discussed in Section 6.1 no single value could be usedas a threshold that would work well for all the datasets on which experiments were run.Instead, one possible way of setting this threshold could be to use reconstruction errorpercentiles from the training set. For example, setting the threshold at the 99th percentilewould theoretically result in 99% of all the flows classified as normal and the top 1% (the1% with the largest reconstruction error) would be marked as anomalous, which couldpossibly result in finding most of the botnet flows. Clearly some consideration wouldhave to be put into choosing a percentile depending on how many flows are presentin the network. Setting the reconstruction error at the example of 99% for a networkwhich produces a million flows per day would result in ten thousand alarms per day,which would likely be overwhelming.

Another technique could be to look at the Cosine Similarity (CS) between the recon-structions and the original flows. CS is a metric which measures the similarity betweentwo vectors. Similarly to how MSE was used for determining how well the Autoencoderreconstructs its input, CS could be used to measure how similar the reconstructions andthe original flows are. This technique would then allow for similar threshold placementsas for using MSE. An even more advanced threshold could possibly be placed by usingboth the MSE and CS values.

7.1.2 Advanced Autoencoder architectures

More advanced Autoencoder architectures could be used for creating a more robust andgeneralizable system. Radford et al. [28] and Lin et al. [18] both used LSTM neuralnetworks for intrusion detection. An LSTM architecture would be interesting to usesince it specializes in sequential data, such as NetFlow. This could allow for detectingtime-dependent features in the datasets and possibly detect more advanced intrusions,however, LSTMs are more computationally expensive and more difficult to tune.

Robust Deep Autoencoder is another advanced type of Autoencoder which Zhou et al.

38

7 Conclusion

[39] found useful for training on datasets where clean (perfectly labeled) data cannotbe guaranteed. This could be a promising type of architecture since it is robust againstoutliers and is less affected by anomalous points when training, which could potentiallyresult in a model that can separate botnet and normal flows for dirty (poorly labeled)datasets. A setting where this could be highly useful is for example in corporate net-works where flow data is logged but not labeled. In this environment, a Robust DeepAutoencoder trained on this unlabeled data could be used in conjunction with somesmart threshold placement as discussed in Section 7.1.1 as a potential botnet detector.

7.1.3 Use reconstruction error as metric for other models

The reconstruction error metric used in this proposed method for botnet detection doesnot have to be the single metric for detecting botnets in NetFlow data. The metriccould be integrated as a feature for other models to be used, similar to how IP2Vec wasused for allowing continuous vector representations of IP addresses [30]. The metriccould be used as a feature for other machine learning methods, for example the methodsdiscussed in Section 3.1 and 3.2.

It could also be worth looking at whether using the reconstruction error metric with athreshold classifier is better or worse at detecting certain botnet flows than other models.Perhaps a combination of multiple models, each producing probabilities of detectedbotnet flows could result in a superior detector, similar to how random forests leveragemultiple decision trees [25, p. 552-553].

39

References

References

[1] E. B. Beigi, H. H. Jazi, N. Stakhanova, and A. A. Ghorbani, “Towards effectivefeature selection in machine learning-based botnet detection approaches,” in 2014IEEE Conference on Communications and Network Security. IEEE, 2014, pp.247–255.

[2] R. Bejtlich, The practice of network security monitoring: understanding incidentdetection and response. No Starch Press, 2013.

[3] L. Bilge, D. Balzarotti, W. Robertson, E. Kirda, and C. Kruegel, “Disclosure: de-tecting botnet command and control servers through large-scale netflow analysis,”in Proceedings of the 28th Annual Computer Security Applications Conference.ACM, 2012, pp. 129–138.

[4] A. Borghesi, A. Bartolini, M. Lombardi, M. Milano, and L. Benini, “Anomalydetection using autoencoders in high performance computing systems,” arXivpreprint arXiv:1811.05269, 2018.

[5] I. P. Carrascosa, H. K. Kalutarage, and Y. Huang, Data Analytics and DecisionSupport for Cybersecurity: Trends, Methodologies and Applications. Springer,2017.

[6] F. Chollet, Deep learning with python. Manning Publications Co., 2017.

[7] M. Collins, Network Security Through Data Analysis: Building Situational Aware-ness. ” O’Reilly Media, Inc.”, 2014.

[8] T. Dunning and E. Friedman, Practical machine learning: a new look at anomalydetection. ” O’Reilly Media, Inc.”, 2014.

[9] K. Gai, M. Qiu, L. Tao, and Y. Zhu, “Intrusion detection techniques for mobilecloud computing in heterogeneous 5g,” Security and Communication Networks,vol. 9, no. 16, pp. 3049–3058, 2016.

[10] S. Garcia, M. Grill, J. Stiborek, and A. Zunino, “An empirical comparison of bot-net detection methods,” computers & security, vol. 45, pp. 100–123, 2014.

[11] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. MIT pressCambridge, 2016, vol. 1.

[12] E. Henry. (2016) Netflow and word2vec -> flow2vec. (Date last accessed: 2019-02-22). [Online]. Available: https://web.archive.org/web/20190125072650/https://edhenry.github.io/2016/12/21/Netflow-flow2vec/

40

https://web.archive.org/web/20190125072650/https://edhenry.github.io/2016/12/21/Netflow-flow2vec/

https://web.archive.org/web/20190125072650/https://edhenry.github.io/2016/12/21/Netflow-flow2vec/

References

[13] Z. Hu, J. Heidemann, and Y. Pradkin, “Towards geolocation of millions of ip ad-dresses,” in Proceedings of the 2012 Internet Measurement Conference. ACM,2012, pp. 123–130.

[14] M. Idhammad, K. Afdel, and M. Belouch, “Dos detection method based on arti-ficial neural networks,” International Journal of Advanced Computer Science andApplications, vol. 8, no. 4, pp. 465–471, 2017.

[15] H. Jiang, Y. Liu, and J. N. Matthews, “Ip geolocation estimation using neuralnetworks with stable landmarks,” in 2016 IEEE Conference on Computer Com-munications Workshops (INFOCOM WKSHPS). IEEE, 2016, pp. 170–175.

[16] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, and J. Srivastava, “A comparativestudy of anomaly detection schemes in network intrusion detection,” in Proceed-ings of the 2003 SIAM International Conference on Data Mining. SIAM, 2003,pp. 25–36.

[17] B. Li, M. H. Gunes, G. Bebis, and J. Springer, “A supervised machine learningapproach to classify host roles on line using sflow,” in Proceedings of the firstedition workshop on High performance and programmable networking. ACM,2013, pp. 53–60.

[18] D. Lin and B. Tang, “Detecting unmanaged and unauthorized devices on the net-work with long short-term memory network,” in 2018 IEEE International Confer-ence on Big Data (Big Data). IEEE, 2018, pp. 2980–2985.

[19] M. W. Lucas, Network flow analysis. No Starch Press, 2010.

[20] M. Małowidzki, P. Berezinski, and M. Mazur, “Network intrusion detection: Halfa kingdom for a good dataset,” in Proceedings of NATO STO SAS-139 Workshop,Portugal, 2015.

[21] S. Mathur, B. Coskun, and S. Balakrishnan, “Detecting hidden enemy lines in ipaddress space,” in Proceedings of the 2013 New Security Paradigms Workshop.ACM, 2013, pp. 19–30.

[22] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word rep-resentations in vector space,” arXiv preprint arXiv:1301.3781, 2013.

[23] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed repre-sentations of words and phrases and their compositionality,” in Advances in neuralinformation processing systems, 2013, pp. 3111–3119.

41

References

[24] M. Moradi and M. Zulkernine, “A neural network based system for intrusion de-tection and classification of attacks,” in Proceedings of the IEEE InternationalConference on Advances in Intelligent Systems-Theory and Applications, 2004,pp. 15–18.

[25] K. P. Murphy, Machine Learning: A Probabilistic Perspective. MIT press Cam-bridge, 2012.

[26] M. M. Najafabadi, T. M. Khoshgoftaar, C. Calvert, and C. Kemp, “Detection ofssh brute force attacks using aggregated netflow data,” in 2015 IEEE 14th Interna-tional Conference on Machine Learning and Applications (ICMLA). IEEE, 2015,pp. 283–288.

[27] M. M. Najafabadi, T. M. Khoshgoftaar, A. Napolitano, and C. Wheelus, “Rudyattack: Detection at the network level and its important features,” in The twenty-ninth international flairs conference, 2016.

[28] B. J. Radford, L. M. Apolonio, A. J. Trias, and J. A. Simpson, “Net-work traffic anomaly detection using recurrent neural networks,” arXiv preprintarXiv:1803.10769, 2018.

[29] S. Raschka and V. Mirjalili, Python machine learning, 2nd ed. Packt PublishingLtd, 2017.

[30] M. Ring, A. Dallmann, D. Landes, and A. Hotho, “Ip2vec: Learning similaritiesbetween ip addresses,” in Data Mining Workshops (ICDMW), 2017 IEEE Interna-tional Conference on. IEEE, 2017, pp. 657–666.

[31] M. Ring, S. Wunderlich, D. Grudl, D. Landes, and A. Hotho, “Flow-based bench-mark data sets for intrusion detection,” in Proceedings of the 16th European Con-ference on Cyber Warfare and Security. ACPI, 2017, pp. 361–369.

[32] X. Rong, “word2vec parameter learning explained,” arXiv preprintarXiv:1411.2738, 2014.

[33] C. Sanders, Practical packet analysis: Using Wireshark to solve real-world net-work problems. No Starch Press, 2017.

[34] M. Stevanovic and J. M. Pedersen, “An efficient flow-based botnet detection us-ing supervised machine learning,” in 2014 international conference on computing,networking and communications (ICNC). IEEE, 2014, pp. 797–801.

[35] D. S. Terzi, R. Terzi, and S. Sagiroglu, “Big data analytics for network anomalydetection from netflow data,” in 2017 International Conference on Computer Sci-ence and Engineering (UBMK). IEEE, 2017, pp. 592–597.

42

References

[36] Q. A. Tran, F. Jiang, and J. Hu, “A real-time netflow-based intrusion detectionsystem with improved bbnn and high-frequency field programmable gate arrays,”in 2012 IEEE 11th International Conference on Trust, Security and Privacy inComputing and Communications. IEEE, 2012, pp. 201–208.

[37] J. Wang and I. C. Paschalidis, “Botnet detection based on anomaly and communitydetection,” IEEE Transactions on Control of Network Systems, vol. 4, no. 2, pp.392–404, 2017.

[38] P. Winter, E. Hermann, and M. Zeilinger, “Inductive intrusion detection in flow-based network data using one-class support vector machines,” in 2011 4th IFIPinternational conference on new technologies, mobility and security. IEEE, 2011,pp. 1–5.

[39] C. Zhou and R. C. Paffenroth, “Anomaly detection with robust deep autoencoders,”in Proceedings of the 23rd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining. ACM, 2017, pp. 665–674.

43

botnet detection on flow data using the reconstruction error from...

Documents