# [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining - Network Anomaly Detection Using Co-clustering

Post on 15-Apr-2017

213 views

Embed Size (px)

TRANSCRIPT

Network Anomaly Detection using Co-clustering

Evangelos E. Papalexakis, Alex Beutel, Peter SteenkisteDepartment of Electrical & Computer Engineering

School of Computer ScienceCarnegie Mellon University,

Pittsburgh, PA, USAEmail: {epapalex, abeutel, prs}@cs.cmu.edu

AbstractEarly Internet architecture design goals did not putsecurity as a high priority. However, today Internet security is aquickly growing concern. The prevalence of Internet attacks hasincreased significantly, but still the challenge of detecting suchattacks generally falls on the end hosts and service providers,requiring system administrators to detect and block attacks ontheir own. In particular, as social networks have become centralhubs of information and communication, they are increasinglythe target of attention and attacks. This creates a challenge ofcarefully distinguishing malicious connections from normal ones.

Previous work has shown that for a variety of Internet attacks,there is a small subset of connection measurements that are goodindicators of whether a connection is part of an attack or not.In this paper we look at the effectiveness of using two differentco-clustering algorithms to both cluster connections as well asmark which connection measurements are strong indicators ofwhat makes any given cluster anomalous relative to the total dataset. We run experiments with these co-clustering algorithms onthe KDD 1999 Cup data set. In our experiments we find thatsoft co-clustering, running on samples of data, finds consistentparameters that are strong indicators of anomalous detectionsand creates clusters, that are highly pure. When running hardco-clustering on the full data set (over 100 runs), we on averagehave one cluster with 92.44% attack connections and the otherwith 75.84% normal connections. These results are on par withthe KDD 1999 Cup winning entry, showing that co-clustering is astrong, unsupervised method for separating normal connectionsfrom anomalous ones.

Finally, we believe that the ideas presented in this work mayinspire research for anomaly detection in social networks, suchas identifying spammers and fraudsters.

I. INTRODUCTION

Security is an increasingly large problem in todays Internet.However, the initial Internet architecture did not considersecurity to be a high priority, leaving the problem of managingsecurity concerns to end hosts. This is therefore a growingproblem for system administrators having to continually fendoff a variety of attacks and intrusion attempts by both individu-als and large botnets. An early study from 2001 suggested thatthere are roughly 2000-3000 denial-of-service (DoS) attacksper week, with many attacks having a lower bound of 100,000packets-per-second [19]. With the rise of botnets, we canonly expect for the prevalence of distributed denial-of-service(DDoS) attacks to have risen over the last decade.

In particular, as social networks have become central tocommunication internationally, there has been an increase inscrutiny and attacks on these services. For example, Twitterwas taken down for multiple hours in 2009 due to a continued

DDOS attack [6], and Google, along with numerous othercompanies, reported being attacked in 2010 [2].

This security issue creates the challenge for system adminis-trators of how to distinguish normal connections of legitimateusers from malicious connections. An administrator, in theface of an attack, wants to block such malicious connectionswithout blocking legitimate users. We can never know for sureif any given connection is legitimate or malicious, but we canattempt to isolate a set of connections that stand out from nor-mal user behavior, so as to help system administrators detectattacks. We call this problem network intrusion detection.

In this paper we propose to apply a set of emerging datamining and machine learning techniques to the problem ofnetwork intrusion detection. The abstract task description isas follows: We are given a set of connection measurements,each of which consists of a set of connection parameters;we are seeking to categorize each connection as normal oranomalous based solely on this set of parameters (i.e. withoutusing any prior knowledge about the connection, or trainingdata). An anomalous connection may be loosely defined asany connection or set of connections that in some way deviatesfrom common behavior. Given such a set of connections, welook to find methods to easily categorize and display why a setof connections are anomalous, with the hope that with humanreview we can identify malicious behavior.

In [16], the authors show that certain types of attacksare strongly correlated with only subsets of a connectionsparameters. In other words, it is possible to determine thetype of an attack just by looking at a specific subset of pa-rameters that best characterizes this attack. The only problemis that these best subsets of parameters are not known a-priori. Previous attempts, such as [16] have employed featureselection techniques in order to single out the most impor-tant parameters overall, and then apply traditional clusteringalgorithms, on the reduced set of parameters. This particularwork essentially does Principal Component Analysis on thedata, and subsequently select the most important features viathresholding.

We, however, propose another approach where we dontneed to single out the most important parameters for thewhole set of attacks. Instead, we seek to find directly the(possibly overlapping) subsets of parameters associated witha set of connections that make such a set abnormal in thelarger set of normal connections. There is a class of data

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.72

403

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.72

403

mining algorithms by the name of Co-clustering [8], [10], [21]that may effectively tackle this problem. In rough terms, Co-clustering may be described as follows: Given a data matrix,we seek to find subsets of rows that are correlated (usuallyin the Euclidean sense) with only a subset of its columns (asopposed to the traditional clustering approach, where a subsetof the rows has to be similar across all columns/dimensions).This description readily applies to the problem at hand, ifwe model the set of connection measurements as a matrix,where the rows correspond to each distinct connection, andthe columns correspond to the connections parameters.

The advantage of the proposed approach over the prior artis the fact that our analysis will not rely on any particular la-belling of the dataset that might yield biased results accordingto the labellers experience of what is normal or abnormal andwhat an attack might look like. This way, on the one hand, weavoid fine tuning our anomaly detection mechanism so that itworks perfectly for just a specific dataset and recognizes onlya specific set of attacks that pertains to a certain labelling of acertain dataset (whereas it might fail on another dataset). Onthe other hand, this allows us to possibly discover attacks andintrusion attempts that go unnoticed or have not been noticedbefore. By looking at anomalous connections more generally,with human review of the data, we are not dependent on justcatching past attacks, but instead reviewing all anomalies thatdont match normal behavior.

We use the KDD 1999 Cup Intrusion Detection Dataset[4]. This dataset is in a connection-by-parameter matrix form,and also contains labels for each connection; these labels willbe used at the evaluation step of the algorithm, when wewill attempt to evaluate the homogeneity of the resulting co-clusters (e.g. connections of the same type end up in the sameco-cluster). It is worth pointing out that the labels will not beinvolved in the actual co-clustering process, for the reasonswe explained earlier on.

The outline of the paper is as follows. In Section II wedescribe our initial motivation for pursuing co-clustering asour method of data analysis. In Section III we describe themethodology for our data mining, explaining the behavior andour usage of the different co-clustering algorithms. In SectionIV we give the results of running numerous experimentswith two different co-clustering algorithms. In Section V weprovide a brief literature survey on network intrusion detection,with particular focus on applications using clustering. Finally,in Section VI we describe our conclusions about the usefulnessof co-clustering and how the two algorithms could be usedtogether to detect and analyze anomalous connections.

NOTATION PRELIMINARIES

A scalar is denoted by a lowercase, italic letter, e.g. x. Acolumn vector is denoted by a lowercase, boldface letter, e.g.x. An IJ matrix is denoted by an uppercase, boldface letter,e.g.X. The Frobenius norm of a matrixX is denoted by XFand is the square root of the sum of squared elements of thematrix X.

II. MOTIVATION

To justify the use of co-clustering for network anomalydetection, we begin by demonstrating the shortcomings anddifficulties of simpler methods, such as Principal ComponentAnalysis, which essentially boils down to regular clustering,considering the entire list of connection parameters.

For this purpose, we use a labelled portion of the dataset,consisting of 494020 connections; 97277 out of them arenormal and 396743 are attacks. There are 22 different typesof attacks in the labelled data set. For each connection thereare 37 recorded measurements, listed in Table I. We forma 494020 37 matrix X where each row corresponds to aconnection and each column corresponds to a parameter. Asa preliminary step, we choose to normalize each column ofX, dividing by the maximum element of each column; thisscaling brings all parameters to the same range and preventsinherently high valued parameters from dominating the matrix.This step is essential because we will use 2 norms/Euclideandistances. In one of the two algorithms we use, we also takethe natural logarithm of the non-zero values of X.

Each connection is now a vector in 37-dimensional space,which makes it hard to view in two or even three dimensions.A standard dimensionality reduction technique, which enablesus to visualize high dimensional data, comes by the name ofPrincipal Component Analysis (PCA), which basically boilsdown to the Singular Value Decomposition of X. Very briefly,X may be decomposed as

X = UVT

where U,V are the left and right singular matrices (whichare orthonormal), and is a diagonal matrix that contains thesingular values ofX. Let u1,u2 be the first two columns ofU.Then, if we plot those two vectors against each other, we getan optimal (in the least squares sense) projection of the datain the 2-dimensional space. In Figure 1 we demonstrate such aprojection for X. This procedure was done, after consideringthe whole set of 37 parameters. We observe that normaland anomalous connections cannot be clearly differentiatedwithout the labels. To corroborate this observation, we alsodid K-means clustering to the rows of X [3] with K = 2, andmeasured the ratio of normal connections to the total numberof connections in each cluster. For cluster 1 the ratio was6 104, which indicates very few normal connections, andfor cluster 2 we got 0.1963, which is also really low. Thisindicates that even though most of the normal connectionsended up in the second cluster, we were not able to tell themapart from the attacks.

The method we propose to use, co-clustering, selects asubset of the rows and a subset of the columns (i.e parameters)of X. A naive and suboptimal way to emulate that behaviour,without actually using co-clustering, is the following. First,we do K-means clustering on the columns of X. This way,hopefully, parameters that are similar should end up in thesame cluster. Consider now Xk for k = 1 . . .K, being theresulting matrix when keeping only the columns that belong

404404

Fig. 1. PCA on X with points colored by given labels, where blue points were labelled as normal, and red points were labelled as an attack.

Measurement Name Measurement Typeduration continuousprotocol_type symbolicservice symbolicflag symbolicsrc_bytes continuousdst_bytes continuousland symbolicwrong_fragment continuousurgent continuoushot continuousnum_failed_logins continuouslogged_in symbolicnum_compromised continuousroot_shell continuoussu_attempted continuousnum_root continuousnum_file_creations continuousnum_shells continuousnum_access_files continuousnum_outbound_cmds continuousis_host_login symbolicis_guest_login symboliccount continuoussrv_count continuousserror_rate continuoussrv_serror_rate continuousrerror_rate continuoussrv_rerror_rate continuoussame_srv_rate continuousdiff_srv_rate continuoussrv_diff_host_rate continuousdst_host_count continuousdst_host_srv_count continuousdst_host_same_srv_rate continuousdst_host_diff_srv_rate continuousdst_host_same_src_port_rate continuousdst_host_srv_diff_host_rate continuousdst_host_serror_rate continuousdst_host_srv_serror_rate continuousdst_host_rerror_rate continuousdst_host_srv_rerror_rate continuous

TABLE ILIST OF CONNECTION MEASUREMENTS IN THE DATA SET.

to the k-th cluster. In Fig. 2 we demonstrate this for K = 3.We can see that, at least, for k = 1 and k = 3, the redpoints (i.e. anomalous connections) are better separated fromblue points, in contrast to Fig. 1. This observation providesintuition and motivation, as to why we need to consider onlysubsets of the parameters, in order to achieve a better result.We proceed to show that by using co-clustering, we can yieldeven better results.

III. METHOD DESCRIPTION

A. Introduction to Co-clustering

In a nutshell, co-clustering seeks to find subsets of rowsand columns in the data matrix that are similar, usually in theleast squares sense. Hard co-clustering attempts to partitionthe rows and the columns of the matrix, therefore 1) nooverlap is allowed, and 2) all rows and columns have tobe assigned to a row/column subset. This problem is NP-hard, and the algorithm we use is an approximation thereof.On the other hand, soft co-clustering relaxes the disjointsubsets constraint, allowing for overlapping between subsets.Because we focus on the application of co-clustering ratherthan the algorithms themselves, we give only a brief overviewof the two co-clustering algorithms below. In the followingdescription, we may use the terms cluster and co-clusterinterchangeably.

1) Hard co-clustering: Information Theoretic co-clustering:In [8], Banerjee et al. introduce a class of co-clusteringalgorithms that employ Bregman divergences, unified in anabstract framework. This approach seeks to minimize theloss of information incurred by the approximation of thedata matrix X, in terms of a predefined Bregman divergencefunction. Bregman co-clustering is what we called earlier ahard co-clustering technique, in the sense...

Recommended