machine learning for network intrusion learning for network intrusion detection ... a sniffer...
Post on 20-Apr-2018
Embed Size (px)
Machine Learning for Network Intrusion DetectionLuke Hsiao
Stephen IbanezStanford University
ABSTRACTComputer networks have become an increasingly valuabletarget of malicious attacks due to the increased amount ofvaluable user data they contain. In response, network intru-sion detection systems (NIDSs) have been developed to detectsuspicious network activity. We present a study of unsuper-vised machine learning-based approaches for NIDS and showthat a non-stationary model can achieve over 35 higherquality than a simple stationary model for a NIDS which actsas a sniffer in a network. We reproduce the results of packetheader-based anomaly detection for detecting potential at-tacks in network traffic and are able to detect 62 of 201 attackswith a total of 86 false positives (an average of under 9 perday) on the 1999 DARPA dataset. Our implementation is opensource, available at https://github.com/lukehsiao/ml-ids.
1 INTRODUCTIONAs the world becomes more and more connected through theInternet, computer networks have become an increasinglyvaluable target of malicious attacks. Large corporate net-works are of particular value to attackers due to the sensitiveinformation that they carry such as private user informationor internal business data. One of the tools used to addressnetwork attacks are NIDS, which seek to monitor networktraffic and detect suspicious activity or other policy viola-tions. A NIDS typically utilizes either (1) signature detectionmethods such as SNORT , or BlindBox , or (2) machine-learning based techniques such as clustering , neural net-works [3, 12], or statistical modeling [6, 11]. Learning-basedapproaches are of particular interest because they have thepotential to detect novel attacks unlike signature based sys-tems which would need to be updated.
One challenge involved with building a NIDS is that theyoften would like to examine encrypted application-levelpacket data in an attempt to detect more attacks. This oftenmeans that the NIDS must intercept secure connections bypositioning as a man-in-the-middle, or in the case of Blind-Box, employ new protocols to inspect the encrypted payloadwithout sacrificing the end users privacy. There are a num-ber drawbacks to these techniques which could be avoided ifthe NIDS did not need to examine encrypted packet data. Asa result, we explore machine learning techniques for buildingan NIDS that only use unencrypted packet header fields.
Figure 1: Our dataset consists of packets collected bya sniffer between a network and an Internet router.
2 METHODA common approach to using machine learning for NIDS isto frame the problem as an unsupervised anomaly detectiontask, where we desire to train a model to recognize normal,attack-free traffic and consequently recognize anomalous,potentially malicious traffic. We implement two methods foranomaly detection: (1) a stationary model using a mixture ofGaussians, which does not incorporate time as a parameter,and (2) a non-stationary model based on the Packet HeaderAnomaly Detection (PHAD) paper . We use 33 fields foundin packet headers as features, as opposed to other systemswhich perform anomaly detection by using the bytes of thepayload . In addition, our non-stationary model priori-tizes the time since an event last occurred rather than thefrequency of the occurrence.We then train and test our models on the widely used
DARPA 1999 IDS dataset1 , which is still considered auseful dataset for evaluating this task despite its age  andconsists of 5 weeks of tcpdump data collected by a snifferpositioned between a local network and an Internet routeras shown in Figure 1. The test data in the DARPA datasetcontains 201 instances of 56 types of attacks.
2.1 Stationary Model: Mixture of GaussiansWe first establish a baseline for the quality of a stationarymodel that does not consider time as a parameter and insteadonly looks at a packets features in isolation. Specifically, wemodel the data as amixture of Gaussians. During training, wetrain a Gaussian mixture model using sklearn2. We empiri-cally selected to model the data with 16 components using
full covariance matrices to keep training time manageable onconsumer hardware. We found no significant differences infinal quality when varying the number of Gaussians. Modelparameters are initialized by first running the kmeans clus-tering algorithm on the data.
While a mixture of Gaussians is typically used to label databased on which Gaussian is the most probable, we insteadlook at how unlikely a particular packets features are giventhe model trained on attack free data. During detection, eachpacket is assigned 16 probabilities P = [p1, . . . ,p16]. We thenscore a packet as (1max P). Packets with a low probabilityof coming from any of the Gaussians have the highest scores.
2.2 Non-stationary Model: PHADNext, we implement the PHAD learning model which isbased on computing an anomaly score for each field of apacket header. During training, we learn an estimate of therate of anomalies for each field. If a particular packet fieldis observed n times and consists of r distinct values, thenwe approximate the probability that the next observation isanomalous by r/n. Note n is not necessarily the same as thetotal number of packets seen, since the packet fields that arepresent depend on the type of the packet. The PHAD modelassumes that anomalous events occur in bursts since theyare often caused by a change of network state (e.g. a changein software or configuration). During training, only the firstanomalous event in a burst is added to r .
During detection, PHADuses time as parameter to accountfor the dynamic behavior of real traffic and to help discountthe weight of anomalous packets after the first packet ina burst. Specifically, they define a factor t for each field asthe time since the previous anomaly in that particular field.Thus, if a packet occurred t seconds ago, PHAD approximatesthe probability that it will occur in the next second as 1/t .PHAD looks at a stream of packets sequentially, and producesa score for each field that is inversely proportional to theprobability of an anomaly in that field:
scorefield =t nr
Rather than assuming that each field is independent (andthus taking the product of each of the fields probabilitiesas a packet score), PHAD approximates fields as occurringsequentially, and scores a packet as the probability of observ-ing consecutive anomalies. Packets scores are the sum of thescores of its anomalous fields.
i anomalous fields
2.2.1 Implementing PHAD-C32 A core implementationchallenge of the PHAD model proposed above is detectingwhether or not a packet header is anomalous. In particular,packet header fields can range from 1 to 4 bytes, allowing up
to 232 values. In the original paper, the authors ultimately se-lect a cluster-based approach for detecting anomalies. Ratherthan store a set of potentially millions of entries for eachfield, the original PHAD implementation instead stores alist of clusters (represented as a number interval) up to alimit defined as C , where PHAD-C32 uses C = 32. Whenthe number of clusters exceeds C , the clusters whose rangesare closest together are combined into a single cluster. Thisimplementation is summarized in Algorithm 1. Clusters arecreated during training, where each field has a set of C clus-ters and are used to estimate r , which is incremented eachtime a cluster is added, and used to determine whether a fieldvalue is anomalous or not based on whether its value fallswithin any of the clusters learned from the training data. Inorder to best recreate the original results, we implement thissame clustering structure using Python.
Algorithm 1 PHAD Clustering algorithm1: C 32 in the case of PHAD-C322: clusters 3: procedure AddValue(a) adding a to the clusters4: for cluster clusters do5: if a cluster then6: return a is non-anomalous7: clusters clusters [a, a] add new cluster8: if len(clusters) > C then9: merge two nearest clusters10: return
2.2.2 Post-processing In order to correctly identify anattack, the model must indicate the destination IP address,the date, and the time of the attack to within one minute ofaccuracy. After providing a list of anomalies that the modelclassifies as attacks, we perform a post processing step thatremoves duplicate detections of the same attack. We keeponly the highest scoring detection for each attack.
3 EXPERIMENTSAs our core experiment, we evaluate the quality of our NIDSimplementation based on two primary metrics: (1) recall,the percentage of attacks which we are able to identify inthe dataset and (2) precision, the percentage of packets weclassify as attacks that are true instances of attacks. Thesetwometrics provide an assessment of how effective our NIDSis at detecting attacks, and how practical such a NIDS systemwould be based on its precision. For this application, a verylow rate of false positives is essential, due to the limitedamount of time that analysts would have to respond to thesein a real-world situation. We use F1 score (the harmonicmean of precision and recall) as a unifying metric for theprecision and recall of a system.
Table 1: Comparing results of our stationary andnon-stationary models with a leading system in theDARPA off-line e