finding common ground among experts' opinions on data ...ghyan/papers/icde14.pdf · on data...

13
Finding Common Ground among Experts' Opinions on Data Clustering: with Applications in Malware Analysis Guanhua Yan Information Sciences Gup (CCS-3), Los Alamos National Laboto Los Alamos, NM 87545, U.S.A. [email protected] Abstract-Data clustering is a basic technique for knowledge discovery and data mining. As the volume of data grows sig- nificantly, data clustering becomes computationally prohibitive and resource demanding, and sometimes it is necessary to outsource these tasks to third party experts who specialize in data clustering. The goal of this work is to develop techniques that find common ground among experts' opinions on data clustering, which may be biased due to the features or algorithms used in clustering. Our work dirs from the large body of existing approaches to consensus clustering, as we do not require all data objects be grouped into clusters. Rather, our work is motivated by real-world applications that demand high confidence in how data objects - if they are selected - are grouped together. We formulate the problem rigorously and show that it is NP-complete. We further develop a lightweight technique based on finding a maximum independent set in a 3-uniform hypergraph to select data objects that do not form conflicts among experts' opinions. We apply our proposed method to a real-world malware dataset with hundreds of thousands of instances to find malware clusters based on how multiple major AV (Anti-Virus) software classify these samples. Our work offers a new direction for consensus clustering by striking a balance between the clustering quality and the amount of data objects chosen to be clustered. I. INTRODUCTION The abundance of data available these days calls for tech- niques and tools that are able to process and manage them in an efficient manner. Data clustering, a task of grouping data objects that share similar characteristics, is one of the most basic techniques for knowledge discovery and data mining and has been widely in a variety of application domains. As the volume of data grows significantly, data clustering can become computationally prohibitive and resource demanding, and sometimes it is necessary to outsource these tasks to third party experts who specialize in data clustering. Experts' options, however, may be biased for a number of reasons. First, different experts may rely on different types of features extracted from the original data objects for clustering. In some cases, for proprietary reasons, experts may not want to reveal the type of features they use to cluster data objects. Second, even using the same type of features for clustering, different experts may use different algorithms to group data objects together. There are a large number of data clustering techniques in the literature, which have been summarized 978-1-4799-2555-1114/$31.00 © 2014 IEEE 15 in a few survey papers (e. g. , [21], [5], [36]). Third, even though the same clustering algorithm is used to group data objects based on the same features, the way with which the distance metric is evaluated between two data objects also affects how these objects are group together. Last but not least, clustering algorithms usually have tunable parameters to control the resolution at which similar objects should be grouped together. Hence, assuming all the other factors are the same, disagreement on these parameters can still lead to disparity in clustering results. To avoid such biases due to the aforementioned reasons, we can crowdsource the task of data clustering to multiple experts, and combine their opinions together on how data objects should be grouped together. Consensus clustering, which aggregates clustering results om different sources, has been investigated in a number of previous works (e.g., [31], [12], [9], [18], [24]). The surmise behind all these efforts on consensus clustering is that all data objects should be grouped. In some applications, however, this is not what we want, as when experts' opinions disagree on how multiple objects should be grouped together, these previous methods would have to force a conclusion which inevitably leads to a conflict. Consider the following motivating example. For three data objects A, B, and C, two experts, Alice and Bob, cluster them as {{A, B}, {C}} and {{A}, {B,C}}, respectively. If we have to group all the objects, there is always a confusion, irrespective of what clustering algorithm is used: should we cluster A and B together (respect Alice's opinion), or B and C together (respect B ob's opinion), or put A, B, and C into either distinct groups or the same group (respect neither's opinion)? In some applications, forcing all data objects into groups can be harmful. This is because once data objects are grouped, those belonging to the same group are oſten further in- vestigated to extract C on characteristics from them as signatures to identify new group members later. If data objects are forced into groups to which they should not belong, the existence of bad samples may affect the quality of the signatures generated for later classification purpose. For such circumstances where the quali of group membership is important, we would rather consider those samples for which ICDE Conference 2014

Upload: others

Post on 21-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Finding Common Ground among Experts' Opinions on Data ...ghyan/papers/icde14.pdf · on Data Clustering: with Applications in Malware Analysis Guanhua Yan Information Sciences Group

Finding Common Ground among Experts' Opinions on Data Clustering: with Applications in Malware

Analysis

Guanhua Yan

Information Sciences Group (CCS-3), Los Alamos National Laboratory

Los Alamos, NM 87545, U.S.A. [email protected]

Abstract-Data clustering is a basic technique for knowledge discovery and data mining. As the volume of data grows sig­nificantly, data clustering becomes computationally prohibitive and resource demanding, and sometimes it is necessary to outsource these tasks to third party experts who specialize in data clustering. The goal of this work is to develop techniques that find common ground among experts' opinions on data clustering, which may be biased due to the features or algorithms used in clustering. Our work differs from the large body of existing approaches to consensus clustering, as we do not require all data objects be grouped into clusters. Rather, our work is motivated by real-world applications that demand high confidence in how data objects - if they are selected - are grouped together. We formulate the problem rigorously and show that it is NP-complete. We further develop a lightweight technique based on finding a maximum independent set in a 3-uniform hypergraph to select data objects that do not form conflicts among experts' opinions. We apply our proposed method to a real-world malware dataset with hundreds of thousands of instances to find malware clusters based on how multiple major AV (Anti-Virus) software classify these samples. Our work offers a new direction for consensus clustering by striking a balance between the clustering quality and the amount of data objects chosen to be clustered.

I. INTRODUCTION

The abundance of data available these days calls for tech­niques and tools that are able to process and manage them in an efficient manner. Data clustering, a task of grouping data objects that share similar characteristics, is one of the most basic techniques for knowledge discovery and data mining and has been widely in a variety of application domains. As the volume of data grows significantly, data clustering can become computationally prohibitive and resource demanding, and sometimes it is necessary to outsource these tasks to third party experts who specialize in data clustering.

Experts' options, however, may be biased for a number of reasons. First, different experts may rely on different types of features extracted from the original data objects for clustering. In some cases, for proprietary reasons, experts may not want to reveal the type of features they use to cluster data objects. Second, even using the same type of features for clustering, different experts may use different algorithms to group data objects together. There are a large number of data clustering techniques in the literature, which have been summarized

978-1-4799-2555-1114/$31.00 © 2014 IEEE 15

in a few survey papers (e. g. , [21] , [5] , [36] ). Third, even though the same clustering algorithm is used to group data objects based on the same features, the way with which the distance metric is evaluated between two data objects also affects how these objects are group together. Last but not least, clustering algorithms usually have tunable parameters to control the resolution at which similar objects should be grouped together. Hence, assuming all the other factors are the same, disagreement on these parameters can still lead to disparity in clustering results.

To avoid such biases due to the aforementioned reasons, we can crowd source the task of data clustering to multiple experts, and combine their opinions together on how data objects should be grouped together. Consensus clustering, which aggregates clustering results from different sources, has been investigated in a number of previous works (e.g., [31] , [12] , [9] , [18] , [24] ). The surmise behind all these efforts on consensus clustering is that all data objects should be grouped. In some applications, however, this is not what we want, as when experts' opinions disagree on how multiple objects should be grouped together, these previous methods would have to force a conclusion which inevitably leads to a conflict. Consider the following motivating example. For three data objects A, B, and C, two experts, Alice and Bob, cluster them as {{A, B}, {C}} and {{A}, {B, C}}, respectively. If we have to group all the objects, there is always a confusion, irrespective of what clustering algorithm is used: should we cluster A and B together (respect Alice's opinion), or B and C together (respect Bob's opinion), or put A, B, and C into either distinct groups or the same group (respect neither's opinion)?

In some applications, forcing all data objects into groups can be harmful. This is because once data objects are grouped, those belonging to the same group are often further in­vestigated to extract COlmnon characteristics from them as signatures to identify new group members later. If data objects are forced into groups to which they should not belong, the existence of bad samples may affect the quality of the signatures generated for later classification purpose. For such circumstances where the quality of group membership is important, we would rather consider those samples for which

ICDE Conference 2014

Page 2: Finding Common Ground among Experts' Opinions on Data ...ghyan/papers/icde14.pdf · on Data Clustering: with Applications in Malware Analysis Guanhua Yan Information Sciences Group

we are confident in their group memberships. This motivates us to take a different approach to finding

common ground among experts' opinions on data cluster­ing. Rather than aggregating clustering results from multiple sources by attempting to group every data object, we only choose and group those data objects for which there is no conflict among experts' opinions in their group memberships. Using the same previous example, our approach does not attempt to group all data objects, A, B, and C. Instead, we choose only a subset of them and form groups among them without causing conflicts among Alice and Bob's opinions. Suppose that we only choose two of them, A and B, and put both into the same group. Without the existence of C, assum­ing that Alice and Bob may group data objects at different resolution levels, the difference in their actions in grouping A and B together or not is explicable. On the other hand, although our approach relaxes the constraint that we have to group every data object, we still want to group as many data objects as possible. This is because provided a larger number of data objects in the same group, the common characteristics extracted from the samples in it are more generalizable to those unseen ones. In the extreme case in which surely no conflicts occur, we do not group any data object, but this is not interesting at all for any practical application.

Against this backdrop, we develop techniques to combine multiple experts' opinions on data clustering in a conflict­free manner. While doing so, we aim to find the maximum set of data objects on which the experts' opinions do not conflict against each other. As the problem of interest is to cluster a large number of data objects that exist in a diversity of application domains, we not only theoretically explore the tractability of this problem, but also investigate methods that can perform efficiently in a practical setting. To justify the relevance of our proposed approach in practice, we find its applications in the field of malware analysis, where the quality of clustering results is crucial to the development of effective signatures for automated mal ware classification.

In a nutshell, our main contributions are summarized as follows. First, we rigorously formulate the problem of finding common ground among clustering results from multiple ex­perts under the constraint that no conflict should be formed among grouped data objects, and prove that this problem is NP-complete. Second, to find a conflict-free set of data objects, we develop an efficient algorithm to generate a conflict graph, which is a 3-uniform hypergraph that characterizes all potential conflicts in experts' opinions on data clustering. Third, we propose three different schemes, one a greedy solution and the other two randomized algorithms, to find the maximum independent set of the conflict graph, based on which we select a conflict-free subset of data objects. Fourth, we further look for connected components in a k-partite graph to group selected data objects into clusters. Finally, we experimentally evaluate our proposed method on a mal ware dataset containing more than 500,000 unique mal ware variants, and show that our method is effective and efficient in producing malware clusters that facilitate further malware analysis.

16

Our work opens a new direction for data clustering by crowdsourcing the task of data clustering to third-party experts and aggregating their opinions in a conflict-free manner. Our proposed method is practically feasible because it works in a purely distributed fashion: we do not assume that the data clustering algorithm used by an expert as well as its parameter settings should be known to any other expert or the party that aggregates the experts' opinions. Also, the group names or identifiers used by any expert bear only symbolic meaning and it is unnecessary for the experts to agree on the convention schemes they should follow to name the groups. Hence, our method has a wide spectrum of real-world applications in addition to malware analysis, which is only the example application, albeit a motivating one, used in this study.

The remainder of the paper is organized as follows. The related work is introduced in Section II. We formulate the conflict-free data clustering problem in Section III, and prove that it is NP-complete in Section IV. We propose a single­level method for choosing a set of conflict-free data objects in Section V, and provide a multi-level algorithm that addresses the computational challenge posed when the conflict graph is large in Section VI. We develop a method of finding connected components from a k-partite graph to group selected data objects into clusters. In Section VIII, we evaluate our proposed method on a real-world malware dataset. We draw concluding remarks in Section IX.

II. RELATED WORK

Clustering is one of the most basic techniques for knowl­edge discovery and data mining. There are a plethora of data clustering algorithms that have been proposed (e.g. , see the survey papers [21] , [5] , [36] ). The type of clustering techniques that resembles our work the most is consensus

clustering, which has been investigated in a number of pre­vious works (e.g. , [31] , [12] , [9] , [18] , [24] ). According to the consensus functions used in consensus clustering, these previous methods fall into the following categories [17] :

• Hypergraph partitioning (e.g. , [31] , [12] ): We can con­nect data objects in a cluster grouped by each individual expert with a hyperedge in a hypergraph. Hence, we can find the min-cut of the hypergraph to obtain the consensus among the clustering results by different experts, as the edges on a cut reflect the disagreement among them.

• Voting approach (e. g. , [ l3] , [10] ): The voting ap­proaches aim to find the label correspondence among different experts in labeling the clusters. They permute the cluster labels in order to find the best agreement between the labels of two partitions.

• Mutual information (e.g. , [32] , [25] , [2] ): These meth­ods formulate the problem of finding consensus clustering based on the mutual information between the probability distributions of the labels in the consensus clustering and those in the ensemble of clusters by the experts.

• Co-association based functions (e. g. , [14] , [15] , [8] , [35]) : A co-association matrix can be used to characterize the pair-wise similarity between how two data objects

Page 3: Finding Common Ground among Experts' Opinions on Data ...ghyan/papers/icde14.pdf · on Data Clustering: with Applications in Malware Analysis Guanhua Yan Information Sciences Group

are clustered by different experts. This matrix can be further used to find the final partition with hierarchical agglomerative algorithms or other methods.

• Finite mixture model: (e. g. , [32] , [33] , [1] ): The tech­niques based on finite mixture models assume that cluster labels of data objects are drawn from a probabilistic model with a mixture of multinomial component densi­ties. Based on this assumption, the maximum-likelihood method can be used to find the best mixture model that fit the observed data, which are the labels in the ensemble of clusters.

Even with a large body of existing consensus clustering methods, our work in this study cannot be classified into any of these categories. The crucial difference is that our method does not require that all data objects should be selected and clustered. As evidenced by the application of our method in mal ware analysis, the existing consensus clustering methods are not desirable for applications where later per­group analysis prefers groups of data objects whose group relationships precisely capture the inherent similarity among data objects. For malware analysis, for instance, too many samples wrongly added to a mal ware group make it difficult to extract effective signatures for detecting instances belonging to this group in a practical malware detection system.

In recent years, crowdsourcing has been a hot field of re­search as technology has made it relatively easy for companies and institutions to outsource traditionally local computational tasks to third-party entities. As evidenced by a number of high­profile collaborative projects such as Wikipedia and Dell's Ideastorm, crowdsourcing can leverage collective intelligence of the crowd to improve the quality and efficiency of various computational tasks. The comprehensive survey paper by Zhao and Zhu provides the state-of-the-art and also the future direction of research related to crowdsourcing [38] . Our work is another addition to the existing literature on crowdsourcing research, and it focuses on how to leverage crowdsourcing to improve a very basic, but widely used, type of computational tasks, data clustering. The rationale behind this work, which is to find common ground among results from multiple third parties in a conflict-free fashion, can be applied to some other types of data processing tasks when they are crowdsourced. When the conflict-free constraint is too stringent for some applications, we can use a further extension of this work, which is to quantify the conflict among different opinions and find regions where conflicts are bounded within a tolerable margin. We plan to extend our work along this line in the future.

Malware analysis is the example application domain in which our proposed method has been found useful. In the computer security literature, clustering techniques have been applied to group malware instances with similar features [3] , [4] , [22] . Our work here is orthogonal to these previous works as clustering results from different types of features can be integrated into a single one with our proposed method here for further per-family malware analysis. Relying on a single feature type for malware clustering may not be robust against

17

obfuscation techniques used by many intelligent malware pro­grams. Supervised learning techniques have been adopted to automate the process of classifying malware variants into their corresponding families [27] , [37] . One key challenge facing automated mal ware classification is how to obtain labeled samples to train classifiers for individual mal ware families. We cannot depend on any single Anti-Virus (A V) software to obtain such labeled training samples, as none AV software can classify all malware instances correctly (otherwise, we do not need to build a new malware classifier any more). Our proposed method in this work offers a promising approach to integrating opinions from multiple AV software in order to find a set of training malware samples labeled with high accuracy.

III. P ROBLEM FORMULATION

In this section, we formulate the problem of finding com­mon ground among experts' opinions in a conflict-free manner. Consider a set of data objects W = {WI, W2, . . . , Wn}. We also have m experts, Xl, X2, ... , Xm. The set of group notations used by expert Xi is denoted by Fi. We let Ci,j E Fi U {u}, where 1 :::; i :::; m, represent the cluster into which Xi groups data object Wj if Ci,j E Fi, or be a general case u if the corresponding expert cannot identify the cluster as any one in Fi. It is noted, here, that if two data objects are both grouped into a generic case u by an expert, it does not necessarily mean that the expert groups it into the same cluster. The general case u offers great flexibility when an expert cannot decide which group a data object should be put into. This can occur for a variety of reasons. For example, if the features that the expert relies on for clustering cannot be extracted from the data object, a general cluster can be assigned to the data object.

We aim to choose a subset of data objects S � W without any conflict in clustering by different experts. To understand the concept of conflict-free data clustering, we first consider any three data objects Wi, Wj and Wk selected into set S. Given any two experts Xa and Xb, we say that there is a conflict if u i- Ca,i = Ca,j i- Ca,k and Cb,i i- Cb,j = Cb,k i- u. The concept of conflict is intuitive here: if the precondition holds true, according to Xa, data object Wj should belong to the same cluster as Wi but a different one from data object Wk; on the other hand, according to Xb, data object Wj should belong to the same cluster as data object Wk> but a different one from data object Wi. Hence, there is inconsistency in grouping data object Wj by the two experts. Here, we note that when only having Ca,i = Ca,j and Cb,i i- Cb,j does not necessarily mean that there is a conflict, because experts Xa and Xb may classify data objects at different resolutions. Given two chickens, one male and the other female, an expert may group both into the same cluster representing a group of chickens but another may group them separately into a group of roosters and a group of hens. As we do not assume there is a priori knowledge shared by experts in how to group data objects, we allow experts to choose their own resolutions to group data objects.

It is noted that the definition of a conflict excludes the cases where Ca,i = Ca,j i- Ca,k> Cb,i i- Cb,j = Cb,k> but (1) Ca,i = Ca,j = u or (2) Cb,j = Cb,k = u. This is because if

Page 4: Finding Common Ground among Experts' Opinions on Data ...ghyan/papers/icde14.pdf · on Data Clustering: with Applications in Malware Analysis Guanhua Yan Information Sciences Group

only (1) holds, we can group data objects Wj and Wk to the same cluster, a different one from data object Wi, without any confusion, since Ca,i = Ca,j = u does not mean that Xa thinks data objects Wi and Wj must belong to the same cluster. The case when only (2) holds is similar. When both (1) and (2) hold, we can simply group the three data objects into different clusters without any confusion.

The conflict-free data clustering ( CFDC) problem is defined by choosing a subset of conflict-free data objects H <;;; W such that IHI is maximized. The motivation behind this is that we want to group as many data objects as possible in a conflict­free manner. Once H is decided, we will further cluster data objects in it into different groups based on the opinions of all the experts. The decision version of the CFDC problem is: given any integer k, does there exist a subset of data objects in W such that there is no conflict among them and its size is no less than k? In the next section, we shall study the tractability of the CFDC problem.

I V. TRACTAB ILITY ANALYSIS

In this section, we prove that the decision version of the CFDC problem is NP-complete. We first introduce the Minimum k-Path Vertex Covering Problem (k-PV CP) [6] , which is: given graph G and a positive number t, find whether

there is a k-path vertex cover S for graph G of size at most

t. The k-path vertex cover for graph G(V, E) is defined as a subset of vertices S E V (G) which contains at least one vertex on every path of order k in G. Note that a path of order k has exactly k - 1 distinct edges on it. Bresar et al. has shown in [6] that for any fixed integer k 2 2 the k-PVCP problem is NP-complete. We now establish the NP-completeness of the C F DC problem based on the 3-PVCP problem.

Theorem 1: The CFDC problem is NP-complete. Proof First, it is not hard to see that C F DC is in NP.

Given any subset H <;;; W, we need O( (I�I) (r:;)), or O(n3m2) time to verify whether a conflict exists.

Given any 3-PVCP instance, we construct a CFDC instance as follows. For each node Vj in the graph G(V, E) in the 3-PVCP instance where 1 :s; j :s; lVI, we have a data object Wj. For each edge ei in G where 1 :s; i :s; lEI, we have a corresponding expert Xi. Hence, we have n = IVI and m =

lEI. For each edge ei = (Vj, v j') E E, we let Cij = Cij' i=- u,

and Cik = u for all k i=- j or j'. We now prove that H, the subset of data objects selected, has at least n - t items if and only if there is a 3-path vertex cover S of size at most t.

, ¢=': Suppose that we have a solution S of size at most t

to the 3-PVCP problem, which is to say, any path of order k must have a vertex belonging to S. For any conflict to occur under a subset of selected data objects, the following holds:

Xi, Cl Cl U

Xi2 U C2 C2 and there must exist two edges (Vj" v12) E E and (v12' v13) E

E, which correspond to a path of order 3 in G, (Vj" v12' v13).

18

Hence, if we do not select any data object that corresponds to a node in set S, the remaining data objects must not lead to any conflict. As there are at least n - t such data objects, this solves the CFDC problem.

'=} ': Suppose that we have a solution W of size at least n - t to the CFDC problem, which is to say, data objects in W do not cause any conflict. Given how we construct the CFDC instance, the subgraph of G induced by the nodes corresponding to only those data objects in W must not have a path of order 3. Hence, the set of nodes that do not belong to this subgraph forms a 3-path vertex cover of the original graph G, and has at most t nodes in it. This thus solves the 3-PVCP problem. •

Given that the CFDC problem is NP-complete, it does not seem to have a polynomial algorithm that can solve it optimally unless P = N P. In the following, we propose an efficient approach to the CFDC problem, although the solution it finds may not be optimal. Our overall algorithm comprises two phases, Phase I and II. In Phase I, we aim to choose the maximum subset of data objects in W that does not cause any conflict among the selected data objects. For this phase, we first propose a solution without consideration of constraints on computational resources, particularly memory. We call this solution single-level conflict-free data clustering,

whose meaning will be self-evident when we describe it in Section V. To address concerns with computational resources, we propose an alternative solution, multi-level conflict-free

data clustering, which will be discussed in Section VI. In Phase II, we group data objects selected in Phase I into clusters based on finding connected components in k-partite graphs. This will be described in Section VII.

V. SINGLE-LEVEL CONFLICT-FREE DATA CLUSTERING

The input to our solution is a matrix C E PJnxm in which element C(i,j) gives the cluster identifier of data object Wi grouped by expert Xj, where 1 :s; i :s; nand 1 :s; j :s; m. Without loss of generality, we assume that the generic cluster U always corresponds to cluster 0 for any expert. Hence, each row of matrix C provides how each data object is grouped by all the experts.

A. Step 1: Condensation of clustering matrix

We first condense the clustering matrix C into another matrix D, in which each row corresponds to a maximal subset of data objects in W that are identically grouped by every

expert, and its jth column corresponds to the jth column in matrix C. Hence, no two rows in matrix D should be exactly the same. We use Si to denote the set of data objects corresponding to the ith row of matrix D, and let f(j), where 1 :s; j :s; n, denote the set that data object Wj belongs to. Thus, Sf(j ) contains data object Wj.

The condensation step can be implemented with a decision tree data structure shown in Figure 1. More specifically, for each row (dl, d2, ... , dm) in matrix C, we start from the root node of the decision tree and following the path formed by the cluster identifiers grouped by experts Xl, X2, . . . , and Xm

Page 5: Finding Common Ground among Experts' Opinions on Data ...ghyan/papers/icde14.pdf · on Data Clustering: with Applications in Malware Analysis Guanhua Yan Information Sciences Group

List of data objects grouped into cluster I by X I, into cluster I by X2, and into cluster 0 by X3 List of data objects grouped into cluster I by XI,

into cluster 2 by X2, and into cluster I by X3

Fig. 1. Illustration of matrix condensation

as (d1, d2, ... , dm), we find the list to which the data object corresponding to this row is appended. Then for each non­empty list at a leaf node of the decision tree, we add a row to the condensed matrix D. It is easy to see that the set of data objects in each row of D share the same grouping results by all the experts. The condensation step takes O(mn) time and also O(mn) space, where we recall nand m are the number of rows and columns in matrix C, respectively.

B. Step 2: Construction of 3-uniform hypergraph

Once we obtain a condensed matrix D from the original clustering matrix C, we next construct a 3-uniform hypergraph which reflects the conflict relationships among sets of data objects. A hypergraph is a generalization of a graph in which each edge is incident on any number of vertices. A k-uniform

hypergraph is a special type of hypergraphs, in which each edge is incident on exactly k vertices.

Our goal is to build a 3-uniform hypergraph, in which each vertex corresponds to a row in matrix D, and there is an edge among vertices Vi, Vj, and Vk if and only if the sets of data objects corresponding to the ith, jth, and kth rows in matrix D lead to a conflict when at least one data object is selected from each of these sets. A naive way of constructing such a 3-uniform hypergraph is to try every combinations of three rows in matrix D, and check whether there is a conflict among them. Doing this, however, would take O(l3m2), where I denotes the number of rows in the condensed matrix D, as there are (;) combinations of three rows and for each combination, we need to check the possibility of conflicts between grouping results of every pair of experts. In cases where there are still a large number of rows in matrix D even after condensation, this simple method poses significant computational overhead.

Next we present an algorithm that achieves O(m2IEI) time complexity in constructing the 3-uniform hypergraph. We consider every pair of experts Xi and Xj, where 1 :s: i, j :s: m. We go through the ith and jth columns as follows row by row. Suppose that the kth row is (di,dj) , where di and dj are the cluster identifiers grouped by experts Xi and Xj, respectively. If neither di nor dj is the generic cluster u, we append row id k to a list L in a data structure associated with every possible pair (di, dj).

19

Fig. 2. Construction of a 3-uniform hypergraph

We call this data structure as a stamp, each defined as (ci, d, I j, lb, r j, rb, L), in which ci and cj are the cluster identifiers grouped by Xi and Xj, respectively. List L contains all the row ids in matrix D that are grouped as ci by Xi and

as cj by Xj. Moreover, the two left pointers l j (forward) and lb (backward) form a linked list of stamps where the clustering results by expert Xi are the same as Ci, and similarly, the two right pointers r j (forward) and rb (backward) form a linked list of stamps where the clustering results by expert Xj are the same as Cj. An example of the data structure is illustrated in Figure 2.

It takes 0(1) time to finish processing all rows for every pair of experts and populating all stamps, where we recall l is the number of rows in the condensed matrix D. Next, we go through every stamp generated as described. Consider any stamp (ci, cj, I j, lb, r j, rb, L). Traversing its left pointers, we obtain a list of rows that are grouped as ci by Xi and any cluster but cj by Xj, which we denote by Ql. Similarly, we traverse along the right pointers and obtain a list of rows that are grouped as cj by Xj and any cluster but ci by Xi, which we denote by Qr. Then, for every triple (al' b, ar) where al E Ql, ar E Qr and bEL, we add an edge to the hypergraph that connects vertices val' Vb and var'

Adding all the edges to the hypergraph for each pair of experts takes O(IEI) time, where lEI is the number of edges in the hypergraph. Since we have to do it for every pair of experts, the overall time complexity is O(m2IEI) , where m is the number of experts. In practice, m is very small and the time complexity is essentially O(IEI), which is obviously the lower bound in constructing the 3-uniform hypergraph.

Recall that Si is used to denote the group of malware instances corresponding to the ith row of matrix D. The accuracy of the algorithm hinges on the following theorem:

Theorem 2: For every edge added to the hypergraph which is incident on vertices Vkl' Vk2 and Vk3, selecting at least one data object from each of Sk" Sk2 and Sk3 leads to a conflict.

Proof Suppose the edge incident on vertices Vk" Vk2 and Vk3 is added when processing the stamp (ci,cj,lj,lb,rj,rb,L). Hence, we have: kl E Ql and k2 E L, and k3 E Qn where Ql and Qr contains all the rows obtained

Page 6: Finding Common Ground among Experts' Opinions on Data ...ghyan/papers/icde14.pdf · on Data Clustering: with Applications in Malware Analysis Guanhua Yan Information Sciences Group

from traversing left and right pointers, respectively. For every data object selected from the krth row in matrix D, it is grouped as ci by Xi and as not cj by Xj. Similarly, for every data object selected from the k3-th row in matrix D, it is classified as cj by Xj and as not ci by Xi. For every data object selected from the k2-th row in matrix D, it is grouped as ci by Xi and as cj by Xj. As neither ci nor cj is generic (i.e. , ci i= u, cj i= u), selecting all three data objects must lead to a conflict based on our definition given in Section III. This completes the proof. •

C. Step 3: Finding maximum independent set

The output from Step 2 is a hypergraph G(V, E) in which Vi E V corresponds to the i-th row of the condensed matrix D. We also associate each vertex Vi E V with a weight, which is I Si I. That is to say, the weight of each vertex is the number of data objects belonging to the corresponding row in the condensed matrix D. It is noted that the independent set

of a hypergraph is a subset of V that does not contain any (hyper)edge in E. We have the following theorem:

Theorem 3: Given any independent set I { Vip Vi2 , ... , Vik} of hypergraph G (V, E) where I <;;; V, selecting all mal ware instances belonging to sets S = Uj�i, Sj does not lead to any conflict.

Proof" Suppose that a conflict occurs when selecting data objects Wa" wa2' and wa3 from set S. According to the definition of a conflict, no two data objects among aI, a2 and a3 belong to the same row in matrix D. That is to say, f(aI) i= f(a2) i= f(a3) where we recall f(a) is the row in matrix D that mal ware Wa belongs to. Hence, the algorithm presented in Step 2 must produce a (hyper)edge in hypergraph G that connects vertices vf(a,), vf(a2) and vf(a3)' This contradicts with the fact that I is an independent set as I now covers the edge formed by vertices vf(a,), V f(a2) and vf(a3)' This completes the proof. •

As our goal is to find a maximum number of data objects among which there is no conflict, it is equivalent to finding a maximum independent set from the weighted hypergraph G(V, E) , i.e. ,

argmaxJ L;�i, ISjl, (1)

subject to: 1= {Vi" ... , Vik} is an independent set of G.

It is known that the problem of finding the maximum independent set of a hypergraph itself is NP-hard. We thus consider the following three different approaches to obtaining an independent set of hypergraph G(V, E) :

(1) Uniform sampling. The uniform sampling method adopts the randomized algorithm proposed in [30] , which works as follows. For every vertex V E V, we sample a random variable Xv using the uniform distribution on [0, I). We define a permutation 7r of all vertices in V in which 7r ( Vi) < 7r ( Vj) if and only if XVi < XVj' We say that Vi has a lower order

than Vj if 7r ( Vi) < 7r ( Vj). The independent set I is generated as follows. Consider each vertex V in V. If for any edge e vertex V is incident on, vertex V has a lower order than at least one other vertex that is also incident on this edge e, we then add

20

100 100

0)

100

80 80 (1) Conflict graph 1 (2) Conflict graph 2

Fig. 3. Two examples of 3-uniform conflict graphs. The weight of a node, which is shown close to the circle, indicates the number of data objects associated with that node.

vertex V to set I. It is obvious that set I obtained thereby must be an independent set of hypergraph G. Following Theorem 2 in [30] , we easily establish the following theorem:

Theorem 4: Let d( v) denote the degree of vertex V in hy­pergraph G(V, E) . The uniform sampling algorithm described above leads to a set of data objects without conflict of expected size ¢us ( G (V, E) ) , where

'" ISil ¢us(G(V,E) ) = � (d(Vi)+1/2)' viEV deVil (2)

Note that for any integer k 2: 0 and real number r, G) is defined as r (r -I) ... (r - k + I)jk!.

A problem with uniform sampling arises when weights of the nodes in the hypergraph have a highly skewed distribu­tion. Consider the conflict graph shown in Figure 3(1). With probability 1/3, the uniform sampling method leaves out the node with 100 data objects, which is obviously suboptimal.

(2) Greedy method. The greedy method, in contrast to the uniform sampling method, takes the weight of each node into consideration. It ranks all nodes according to their weights in a non-increasing order, and considers each node one by one. If choosing the node under consideration leads to a conflict (i.e. , a edge covered by all chosen nodes), then we do not select any data object associated with this node. Clearly, the greedy method favors nodes with high weights. This method, however, leads to a suboptimal solution as well. Consider the conflict graph shown in Figure 3(2). The greedy method leads to selection of both nodes 1 and 2, and as a result, 200 data objects are selected for lineage analysis. The optimal solution, however, contains 340 data objects when nodes 3, 4, and 5 are chosen and only one of nodes 1 and 2 is selected.

(3) Beta sampling. The shortcomings of both the uniform sampling and greedy methods suggest that we should strike a balance between these two extremes. We thus consider a third method, which relies on the Beta distribution to skew towards nodes with high weights when sampling nodes in the conflict graph. The probability density function of the Beta distribution is given by:

. _ I a-I ,6-1 f(x,a,(3)-B(a,(3)

x (I-x) , (3)

where B( a, (3) is the normalization constant, and both shape parameters a and (3 are greater than O. The mean of the

Page 7: Finding Common Ground among Experts' Opinions on Data ...ghyan/papers/icde14.pdf · on Data Clustering: with Applications in Malware Analysis Guanhua Yan Information Sciences Group

Beta distribution, which is 0/(0:+(3), decreases monotonically with (3 when 0 is fixed. Let F (x; 0, (3) denote the cumulative probability function, i.e. , F (x; 0, (3) = f; f (x; 0, (3) dx.

The beta sampling scheme works as follows. For each vertex v in the conflict graph, we define (3 ( v) as the weight of the node (i.e. , the number of data objects associated with the node in the condensed clustering matrix). We sample random variable Xv from [0, 1] with the Beta distribution f (x;o,(3 (v)) for vertex v. After we sample for every vertex, the process of finding the independent set of the conflict graph is the same as we do in the uniform sampling scheme. For beta sampling, parameter 0 is empirically set according to the problem under study. We shall experimentally evaluate its effects in Section VIII.

We now analyze the expected number of data objects se­lected from the Beta sampling scheme. We have the following:

Theorem 5: Let N (v) denote that the set of neighbors of vertex v in the conflict graph G(V, E) . The Beta sampling scheme described above leads to a set of mal ware instances without conflict of expected size 1>bs(G(V, E) ) , where

(4) v,EV

LViEV ISil fo1 f (x; 0, ISil) OjEN(v,) F (x; 0, ISjl) dx, (5)

where f (x; 0, (3) and F (x; 0, (3) are the probability density function and the cumulative density function of the Beta distribution, respectively.

Proof Vertex Vi is not selected when its sampled value is larger than those of all its neighbors. The probability with which it occurs is fo

1 f (x; 0, ISil) OjEN(Vi) F (x; 0, ISjl) dx. The theorem thus naturally follows. •

VI. MULTI-LEVEL CONFLICT-FREE DATA CLUSTERING

In situations where there are a large number of data objects, the conflict graph produced in Section V may be too large to be contained in the memory, which significantly affects the practical use of the algorithm. To address this concern, we further propose a multi-level conflict-free data clustering algorithm, based on the single-level solution presented in the previous section, to select data objects that do not lead to conflicts even with only limited computational resources.

The crux of the multi-level conflict-free data clustering algorithm is that, when faced with a big condensed clustering matrix, we split it into two halves, and for each half, obtain a set of data objects without conflicts; we then merge the two sets of data objects together, derive the corresponding conflict graph, and further decide a subset of data objects from the merged set without any conflict. The details of the algorithm are presented in Algorithm 1, and illustrated in Figure 4.

The choice of parameter h in Algorithm 1 reflects the number of rows in a matrix that the algorithm can deal with directly without splitting. Provided that the original condensed clustering matrix D contains r rows, the number of levels in the tree illustrated in Figure 4, denoted by l, is llog2 (r 1 h) l + l. As the tree may not be balanced, the numbers of SPLIT and

21

Algorithm 1 Multi-level conflict-free data clustering 1: procedure SELECT(A) r> Select rows from matrix A without conflicts 2: if matrix A has more than h rows then 3: (Atop, AboUo=) +- SPLlT(A)

4: 5top +- SELECT(Atop) 5: 5boHom +- SELECT(AboUom)

6: Arnerye f- MERGE(A, StoP) SbottomJ 7: Smerge f- FIND(Amerge ) 8: return Sme-rge 9: else

10: 5 +- FIND(A) 11: return 5 12: end if 13: end procedure 14: procedure SPLlT(A) r> Split matrix 15: Split A into Atop and AboHom of (approximately) equal sizes 16: return (Atop, Abottom) l7: end procedure 18: procedure MERGE(A, 50, 5,) r> Merge rows 19: 5 +- 50 U 5, 20: A' +- the matrix formed by rows in 5 of A 21: return A' 22: end procedure 23: procedure FIND(A) r> Find rows without conflicts

24: Construct conflict graph H for A (See Section V-B) 25: 5 +- independent set for H (See Section V-C) 26: return 5 27: end procedure

SPLIT, MERGE, FIND

SPLIT, MERGE, FIND /

FIND FIND

�PLlT' MERGE, FIND

§

;t' � �

FIND FIND

Fig, 4, Illustration of multi-level data clustering (calling SELECT(D))

MERGE operations are at most 21-1 - 1, and the number of FIND operations is at most 21 - l.

Discussions: It is noted that although Algorithm 1 is imple­mented by splitting a big matrix in halves, it can be extended by splitting it to d parts and the structure shown in Figure 4 thus becomes a d-ary tree. Moreover, to further facilitate scalable computation, the algorithm can be implemented in the MapReduce paradigm with a chain of pairs of MAP

and REDUCE operations. Suppose that the intended number of levels in the d-ary tree is l. Then, initially we partition the condensed clustering matrix D into d1-1 parts, which are numbered from 0 to d1-1 - 1 respectively. Then the j­th mapper and reducer, where 1 ::; j ::; l, work as follows, respectively:

MAPj: (Kf, V/) ---+ (Kg, Vj), Kf is the index of the node in the j-th level (from bottom to top) in the d-ary tree, and V/ is a list of rows in matrix D. The map function performs the FIND operation in Algorithm 1. In the output, Kg = Kf Id (integer division) and V1 is the list of rows selected from the FIND operation.

REDUCEj: (Kg, vj) ---+ Vf' The reduce function merges the d lists with the same key vj into a single L(Vj) , and put the tuple (vj, L(Vj) ) into Vf'

The pair of MAP j and REDUCEj functions are executed iteratively from j = 1 to l. Note that in the found round,

Page 8: Finding Common Ground among Experts' Opinions on Data ...ghyan/papers/icde14.pdf · on Data Clustering: with Applications in Malware Analysis Guanhua Yan Information Sciences Group

Partition 1 Partition 0 Partition 2 (Expert Xa) (Expert Xb)

Cluster I

e----:-+ ___ .-Row 3. data object {6.7} Cluster 2 e

Cluster 2

Row 4. data object {8}

Two groups: {1.2.3.4.5.8}. {6.7}

Fig. 5. Illustration of clustering selected data objects

REDUCEZ does nothing but returns the sole merged list in Vf as the final result. Chaining multiple MapReduce jobs is supported by Hadoop [19] , a popular open-source framework that implements the MapReduce paradigm.

VII. CLUSTERING BASED ON k-PARTlTE GRAPH

The conflict-free data clustering algorithm, regardless of being single-level or multi-level, produces a list of rows, denoted by L, in the original condensed clustering matrix. Choosing all sets of data objects on this list surely does not lead to any conflict. In the final step, we use an (m + 1 )-partite graph to group selected data objects, where we recall m is the number of experts in our problem formulation.

An (m+ I)-partite graph contains m+ 1 partitions, Po, PI, . . . , and Pm. Partition Po has L vertices, each of which represents a row on list L. For any partition Pi where 1 ::; i ::; m, it has IFi I vertices, each representing a cluster grouped by expert Xi. There are no edges among vertices inside the same partition, and the edges across different partitions are constructed as follows. Consider each row i selected onto list L, and the corresponding vertex in partition Po is v �O). Suppose that the cluster that expert Xj classifies the set of data objects corresponding to this row is Ci,j' Then, if Ci,j i= u or equivalently, Ci,j E Fj, we add an edge between vertex v �O) and the one representing family Ci,j in partition Pj. Hence, edges only exist between partition Po and another partition among {Pdi=l, . . . ,m.

Once we finish adding edges to the (m+ 1 )-partite graph for each row on list L, we obtain a set containing all its connected components, denoted by {QI, Q2, ... , Qd where k

is the number of connected components in the (m + 1 )-partite graph. For each connected component Qi where 1 ::; i ::; k,

we construct a group of data objects as follows: we extract all vertices that belong to partition Po from it, and for each such vertex v, we obtain the set of data objects on the row in matrix D that corresponds to vertex v, and add all the data objects in this set to the group.

Figure 5 illustrates how we group data objects together based on the chosen list L = {I, 2, 3, 4}. As there are two connected components in the 3-partite graph, the final two groups generated are {I, 2, 3, 4, 5, 8} and {6,7}.

22

VIII. EXPERIMENTS

In this section, we show how our proposed method can be used in a specific application domain, mal ware analysis. We will provide a brief introduction to the importance of the problem, and then use a real-world malware dataset to demonstrate the performance of our proposed method. As Python is the common language adopted in our prototype system for automated malware classification [37] , [23] , we use it to implement our algorithms. Python is also widely used by the mal ware analysis community; for instance, the popular IDA Pro provides a Python interface to its functionalities. All our experiments are performed on a Linux desktop with a dual-core processor of 1.6GHz and 6G memory.

A. Primer on Malware Analysis

The sheer volume and variety of malware rampant in the cyberspace have posed severe threats to its security. In the year of 2011 alone, the number of new malware variants detected by Symantec increased by 40% from the previous year and reached as many as 400 million [20] . As the majority of mal­ware variants belong to only a small number of families [26] , an important task in malware analysis is to classify mal ware samples into their corresponding families. For mal ware sam­ples belonging to the same family, we can study common characteristics shared among the malware variants in the same family, which can be extracted as signatures for malware detection on end users' computers, and also investigate the evolution pattern and trend of the entire family.

One hurdle facing mal ware analysis is obtaining labeled mal ware samples with family information to bootstrap the process of automated malware classification [37] . Manually reverse-engineering each malware variant to identify its family is difficult, if not impossible, as the process is time consuming and demands advanced techniques in mal ware analysis. Hence, a common practice is to rely on classification results by existing AV software. However, different AV software may classify the same malware instance into different families, as observed by a number of previous efforts (e.g., [3] ). Even for the same malware family, different AV software may use different names. For instance, a mal ware variant classified as a Vundo instance by McAfee may be named as Virtumonde

by NOD32, or as Monderb by Kaspersky. Hence, the family names used by a AV software can have only symbolic mean­ing. What further complicates malware classification is that AV software may classify malware at different resolutions. Using the same Vundo family as an example, we have seen that five family names are used by Kaspersky: Mondera, Monderb,

Monderc, Monderd, and Mondere.

For the purpose of malware analysis, we want to find as many malware instances as possible that we are confident they indeed belong to the same family. With a larger number of samples of the same family at hand, the observations we make about this family from these samples are more likely to be shared by other unseen samples that also belong to this family. Hence, signatures extracted from a larger pool of

Page 9: Finding Common Ground among Experts' Opinions on Data ...ghyan/papers/icde14.pdf · on Data Clustering: with Applications in Malware Analysis Guanhua Yan Information Sciences Group

samples in the same family would be more useful to detect future instances belonging to this family.

B. Malware Dataset

The malware dataset we use in this work was obtained from Offensive Computing [28] . It contains 526,179 unique mal ware variants collected in the wild. We upload the MD5 signatures of all thse mal ware variants to the VirusTotal website [34] and get the detection results of 43 AV software on all these malware variants. In this work, we focus on the detection results of the five major commercial AV software, i.e. , McAfee, ESET (NOD32), Kaspersky, Microsoft, and Symantec. For example, the mal ware variant whose MD5 digest is bd2 6 4 8 0 0 2 0 2 1 0 8 f 8 7 O d 5 8 b 4 6 6 a1 ed 3 1 5 is de­tected by these five AV software as follows:

AV Software Detection result Family name

McAfee Vundo.gen.m Vundo NOD32 a variant of Win32/Adware.Virtumonde.NBG Virtumonde

Kaspersky Troj an.Win32.Monderb.gen Monderb Microsoft Trojan:Win32/VlIndo.BY Vllndo Symantec Packed.Generic. 1 80 GENERIC

The malware naming schemes adopted by these five AV software all follow the CARO standard [7] . We parse each detection result, and extract the most unique name as the family name. In some cases (e.g. , P acked . Generic . 1 8 0

detected by Symantec in the above table), no unique name can be extracted so the malware is assumed to be classi­fied into a generic class, which corresponds to class u in our problem formulation (see Section III). Some mal ware instances cannot be detected by an AV software and are classified as benign by it. For some other malware instances, the VirusTotal service does not send back any classification result by a particular AV software. For all such cases, we use the general class u to imply that the corresponding AV software does not classify a mal ware instance into any partic­ular malware family. As another example, the mal ware variant with MD5 0 2 8 de fcde 6 c 4 3 8 d 0 8 3 6 a 4 7 d f 3 d 9 9 9 2 e 9 is classified into the Vundo family by McAfee, but into the Mondera family by Kaspersky, suggesting that AV software may classify mal ware variants at different resolutions.

Given the malware dataset, we are able to get detection results by the five AV software from VirusTotal for 448,790 mal ware instances. Hence, the original clustering matrix C contains 448,790 rows. The number of families identified by each AV software is shown in the following table:

AV NOD32 Symantec Microsoft Kaspersky McAfee

Families 1 2,933 1 0,087 20.964 1 7.087 1 6,329

The original clustering matrix C is condensed into 143,101 rows in the condensed clustering matrix D, suggesting that it is common malware variants are classified into the same results by the AV software.

C. Evaluation of conflict graph construction

We first evaluate the performance of our method presented in Section V-B, which constructs a 3-uniform conflict graph from a condensed clustering matrix. We randomly choose 1,000 rows from the condensed clustering matrix D as a

23

(1) Rows

2200'rr-����------c"] 2000 180 160 140 120 100

(2) Malware instances

Fig. 6. Comparison of different schemes of finding a maximum independent set of the conflict graph. The results are shown in mean values, together with their 95% confidence intervals. Beta-x indicates sampling using the beta distribution with parameter a = x.

candidate matrix D', and run our algorithm on D'. We repeat this for 10 times, and the mean execution time is 4.48 seconds with standard deviation 0.64 second. For comparison, we also test the performance of a naive algorithm that exhaustively checks every three rows in D' to find whether they lead to a conflict (i.e. , a new edge). Its mean execution time is 2416.79 seconds and the standard deviation is 28.45 seconds. Clearly, our algorithm significantly reduces the execution time of constructing a conflict graph: on average, the naive algorithm used 539 times as much execution time as our method did.

D. Schemes of finding maximum independent set

We next compare the different schemes of finding the maximum independent set of a conflict graph discussed in Section V-c. Similarly, we randomly choose 1,000 rows from the original condensed clustering matrix, construct a conflict graph from it, and then use three methods, greedy, uniform

sampling, and beta sampling, to find the independent set of the conflict graph. For the beta sampling scheme, we vary parameter ex among l .0, 10.0, 100.0, 1000.0, 10000.0, and 100000.0. We repeat this experiment for 20 times, and for each scheme, we obtain the number of chosen rows in the condensed clustering matrix and the number of malware instances selected eventually. The distinction here is due to the fact that each row in the condensed clustering matrix represents a mal ware group.

The results are depicted in Figure 6. We make the following observations. First, the greedy method tends to choose a much smaller number of rows from the condensed clustering matrix than the other methods. This confirms our intuition behind Figure 3(2): the greedy method prefers to choose nodes with high weights, which prevents selection of a larger number of rows as they form conflicts with those nodes of high weights. By contrast, the randomness nature of both uniforming sampling and beta sampling leads to a similar number of rows chosen (about 300 rows). Second, the greedy method and the uniform sampling scheme both select a much smaller number of malware instances eventually than the beta sampling scheme. This is because the beta sampling scheme overcomes the weaknesses of both the greedy method and the uniform sampling scheme, as demonstrated in Figure 3. We also notice that when we increase parameter ex, the number of

Page 10: Finding Common Ground among Experts' Opinions on Data ...ghyan/papers/icde14.pdf · on Data Clustering: with Applications in Malware Analysis Guanhua Yan Information Sciences Group

V#�500��100MO--�200MO--4�OO�O-'� Parameter h

(1) Rows

#, . :; , .

#l . $ 1 '

# � " , $ ' . � #

� I :;- I i

�SO�O --�100MO--2�OO�O�4�OO�O -'8�OOO Parameter h

(2) Malware instances

Fig. 7. Effects of parameter h under the multi-level conflict-free malware selection scheme. The result at each data point is shown in its mean value from 26 sample runs as well as its 95% confidence interval. Beta sampling is used with a = 100.0 .

malware instances selected also increases, and after it reaches its peak, it slightly drops off. In our later experiments, we set ex to 100 based on our observation here.

E. Effects of parameter h in Algorithm 1

Recall that parameter h in the multi-level conflict-free malware selection scheme controls the resolution at which the FIND procedure is performed directly without further splitting the matrix. Figure 7 shows the effects of parameter h on the number of rows in the condensed clustering matrix and the number of malware instances selected eventually. We observe that when we increase parameter h, the number of rows chosen decreases slightly. Note that the x-axis is shown at a logarithmic scale. On the other hand, the number of mal ware instances varies little with different settings of parameter h. The results suggest that the performance of the multi-level conflict-free malware selection scheme is not sensitive to the choice of parameter h. In our later experiments, we shall always set parameter h to 1000.

F Malware clustering

With the multi-level conflict-free malware selection scheme, we obtain a set of 119,820 conflict-free malware instances, which are further grouped into 11 ,861 clusters using the mal ware clustering scheme described in Section VII. Here, we note that there are 17,528 mal ware instances, which are classified as generic by each of the five AV software; each of these malware is put into a separate malware cluster. In Figure 8, we depict the distribution of the sizes of the malware clusters. The curve clearly shows the highly skewed distribution of the sizes of the mal ware clusters. Note that both the x- and y-axes are presented at a logarithmic scale. For instance, the top 10 and 100 largest clusters account for 32.4% and 55.4% of all chosen malware instances, respectively. Such a highly skewed distribution agrees well with the observation that the majority of mal ware variants belong to only a small number of malware families [26] .

In Table I, we further show how the malware instances in the top 10 largest clusters are classified by each AV software. From this table, we can make the following observations. (1) We confirm that different AV software use different malware family names, although these names are shared among them in many cases. For instance, the largest malware cluster is

24

10' I'---

10'

10 ,

'" �

� , 10 103 Ra n k

10 4 10 5

Fig. 8. Number of malware instances in a cluster against its rank in size

1 4 0 0 m " 1 2 0 0 c 0 0 • 1 0 0 0 :': • 8 0 0 E ." " 6 0 0 c 0 4 0 0 ." " " 0 2 0 0 • " '"

Non - I / O time � Overall time ... . . )( . . . . .

1 6 3 2 6 4 1 2 8 Divisions

Fig. 9. Execution time of our algorithm vs. data sizes

identified as A l l ap l e by NOD32, Microsoft, and Kaspersky, but as Rahack or RAHack by Symantec and McAfee. (2) The eighth largest cluster includes malware instances from both the Z lob and Vap s up families. Z l ob is a computer trojan, which stealthily installs malicious plugins for Microsoft Internet Explorer, such as the Vap s up adware [39] . Due to their close relationships, different AV software may set the boundary between these two families differently. Such inconsistency among different AV software explains that both families are grouped into the same cluster. (3) Different AV software may classify malware instances at different resolutions. In addition to the Vundo family we mentioned earlier, the third largest cluster mainly includes instances classified as Agent by the NOD32 software, but this cluster clearly includes two subgroups: a family of Trojan horses that download files and modify the Start Page for the Internet Explorer brower [11] , and another that steals online video game credentials [16] . (4) For the same Vundo family, the Kaspersky AV software divides it into a number of fami­lies, including Monder, Mondera, Monderb, Monderc,

Monderd, and Mondere. But why the ninth largest cluster includes instances classified as only Monder and Monderb

by the Kaspersky AV software? We notice that some Monder* instances classified by Kaspersky are identified as Conhook

variants by the Microsoft AV software or the Kryptik

variants by the NOD32 AV software. To address the potential conflicts due to confusion among different AV software, the beta sampling scheme adopted by our method chooses to leave out some instances that may indeed belong to the Vundo

family.

Page 11: Finding Common Ground among Experts' Opinions on Data ...ghyan/papers/icde14.pdf · on Data Clustering: with Applications in Malware Analysis Guanhua Yan Information Sciences Group

TABLE I CLA S SIFICATION RES ULTS OF MALWARE INSTANCES IN THE TOP 10 LARGE S T CLUSTERS (t = 0)

Rank Size NOD32 Symantec Microsoft Kaspersky McAfee 1 1 5,279 Allaple Rahack Allaple Allaple RAHack

2 4,97 1 Hupigon Graybird Hupigon Hupigon BackDoor-AWQ 1 , 8 1 7 Hupigon Graybird Hupigon Hupigon Artemis

3 1 ,520 Agent Farfti Farfti Hmir Farfti 1 ,439 Agent Gampass Mapdimp OnLineGames PWS-OnlineGames

2 generic generic generic generic PWS-Onlinegames

4 1 ,287 Swizzor Lop C2Lop Swizzor Swizzor 896 Swizzor Lop C2Lop Obfuscated Swizzor 285 Swizzor generic C2Lop generic Swizzor

5 1 , 1 5 1 Pacex NsAnti Gamanja Krap Generic]WS 1 ,25 1 Pacex NsAnti OnLineGames Krap PWS-Gamania

6 1 ,499 Kryptik generic

664 Kryptik generic

7 474 Banker Bancos 391 Banker Bancos 372 Banload generic 369 Banker Bancos

8 854 Zlob Zlob 556 Vapsup Zlob 294 generic generic

9 758 Virtumonde Vundo 402 Virtumonde generic

388 generic generic 1 20 Virtumonde generic

10 1 , 1 44 Rbot Spybot 256 Rbot !RCBot

6 generic !RCbol I generic !RCbol

G, Execution Time

In another set of experiments, we study how the execution time of our algorithm changes with the number of malware instances in the dataset. For the original clustering matrix C which contains 448,790 rows (see Section VIII-B), we randomly permute its rows and then divide it into n parts of equal lengths, where we vary n among 1, 2, 4, 8, 16, 32, 64, and 128 , Next, for n = k, we run our algorithm on each part for 128/ k times, Hence, for each number of divisions n,

we run our algorithm 128 times, and then obtain the average execution time. The results are shown in Figure 9, including execution times both with and without the VO operations (e.g. , read the original clustering matrix from the disk and print out the clusters in the final partition). As the standard deviation of the execution time is small in each case, we omit them in order not to overcrowd the plot. Also note that the x-axis in Figure 9 is shown at the logarithmic scale.

From Figure 9, we observe that for each case, our algorithm is able to find the consensus in clustering about 450K mal ware instances by multiple AV software within 25 minutes using a commodity Pc. Also note that our algorithm is implemented in Python, suggesting that we can even further accelerate the execution of our algorithm by implementing it in a computa­tionally more efficient language such as C/C++.

Another observation is that the execution time of our algorithm does not scale linearly with the number malware instances in the dataset. Instead, it scales super-linearly with the number malware instances. This is because when the number of mal ware instances increases, the number of con­flicting relationships among them grows super-linearly. That is to say, the number of superedges in the conflict graph also grows super-linearly with the number of malware instances.

generic Redirector generic generic generic generic

Banload Banker PWS-Banker Banker Banker PWS-Banker generic Banload PWS-Banker

Bancos Banker PWS-Banker

Zlob Zlob Puper

Zlob Vapsup generic Zlob generic generic

Vundo Monder Vundo Vundo Monderb Vundo Vundo generic generic Vundo generic Vundo

Rbot Rbot Sdbot Rbot Rbot Sdbot

generic generic generic generic generic Gobi

25

For instance, consider two sample runs when the number of divisions n is 128 and 64, respectively. When n is 128, the number of superedges in the conflict graph is 24,034,993, and when n is 64, the number of superedges in the conflict graph has increased to 166,732,849. Hence, when the number of mal ware instances doubles, the number of superedges in the conflict graph has increased by almost 6 times. Moreover, when n is 32, this number becomes so large that even running on a high-end server, our Python-based counting program could not finish after two weeks. Hence, we should not expect that the execution time should scale linearly with the number of malware instances in the dataset.

H. Tunable confidence in malware selection

Our proposed method aims to find mal ware clusters among mal ware instances selected without leading to conflicts. One may argue that in some scenarios, our scheme may be overly optimistic because any AV software may wrongly classify a variant into a specific family (Type I errors). For instance, for the third row in the eighth largest mal ware cluster in Table I, only the Microsoft AV software classifies it to the Z lob family; hence, if the Microsoft AV software makes a mistake here, malware variants that do not belong to the family are misclassified. Intuitively, for a malware which the majority of AV software agree it belongs to a family, we have more confidence in classifying it into that family than one that only the minority of AV software decide to classify to that family.

The challenge, however, is that we do not assume to have a priori knowledge about the correspondence among the malware family names used by different AV software. Such an exact mapping actually does not exist as different AV software classify mal ware variants at different resolutions.

Page 12: Finding Common Ground among Experts' Opinions on Data ...ghyan/papers/icde14.pdf · on Data Clustering: with Applications in Malware Analysis Guanhua Yan Information Sciences Group

(1) Malware instances (2) Malware clusters

Fig. 10. Numbers of malware selected malware instances and malware clusters generated when we vary parameter t

To select mal ware variants with high confidence, we can apply another heuristic, alongside the conflict-free mal ware selection principle: before we construct the conflict graph from a condensed clustering matrix, we further simplify the matrix by choosing only those rows with at least t AV software classifying the corresponding malware instances to a family other than the g e n e r ic one. The results presented correspond to the case where t = O.

In Figure 10(1), we show the number of malware instances selected when we vary parameter t. It is observed that param­eter t controls the number of mal ware instances selected by our algorithm in an almost linear fashion. On the other hand, the numbers of malware clusters under different settings of parameter t are illustrated in Figure 10(2). The curve shows that there is a significant drop in the number of mal ware clusters generated from t = 0 to t = 1 . This is because there are a significant number of mal ware instances in the original clustering matrix that are classified into the generic family by each of the five AV software we consider. Recall that for such mal ware instances, we put each of them into a separate mal ware cluster. As we further increase parameter t, the slope of the curve becomes much less steep.

Next, we take a closer look at the top largest clusters when t = 2, which means that we only consider those mal ware instances that have been classified by at least two AV software into a malware family but the generic one. Table II presents the classification results by the five AV software (in the order of NOD32, Symantec, Microsoft, Kaspersky, and McAfee) for the top 10 largest mal ware clusters (rows with a singleton mal ware instance are omitted). The results are similar to what we have seen in Table I. One interesting observation is that the Rbot and S dbot malware families are grouped into the same cluster. It is known that the development of Rbot malware has been influenced by Sdbot, whose source code was made public on the Internet [29] . Hence, there are malware instances that AV software have different opinions on whether they should be classified into the S dbot or the Rbot family. This explains why malware instances belonging to both families may be clustered together.

I. Discussions

As the principle behind our work is to find common ground among classification results from multiple AV software, the overall performance of our method hinges on the classification

26

TABLE II CLAS S IFICATION RES ULTS OF MALWARE INSTANCES IN THE TOP 10

LARGE ST CLUSTERS (t = 2)

Allaple. Rahack, Allaple, Allaple, RAHack

Hupigon, Graybird, Hupigon, Hupigon, BackDoor-AWQ Hupigon, Graybird, Hupigon, Hupigon, Arlemis

Swizzor, generic, C2Lop, generic, Swizzor Swizzor, generic, C2Lop, Swizzor, Swizzor Swizzor, Lop, C2Lop, Swizzor, Swizzor

OnLineGames, Gampass, OnLineGames, OnLineGames, PWS-OntineGames Pacex, NsAnti, OnLineGames, Krap, PWS-Gamania

Banker, Bancos, Banker, Banker, PWS-Banker Banker, Bancos, Bancos, Banker, PWS-Banker Banker, Bancos, Banload, Banker, PWS- Banker Banload, generic, generic, Banload, PWS-Banker

Rbol, !RCBol, Rbol, Rbot, Sdbol

Rbol, Spy bot, Rbot, Rbot, Sdbot SdBol, Randex, Sdbol, SdBol, Sdbot SdBot, Sdbot, Sdbot, SdBot, Sdbot

generic, generic, CNNIC, generic, Adware-BDSearch generic, generic, CNNIC, generic, Adware-CDNHelper

PcCtient, generic, PcClient, PcCtient, BackDoor-CKB PcCtient, Pcelient, PcClient, PcCtient, BackDoor-CKB PcCtient, Formador, PcCtient, PcCtienl, BackDoor-CKB

Agent, Farfti, Farfti, Hmir, Farfti

Kryptik, generic, generic, Redirector, generic

accuracy of individual AV software. If the majority of AV software involved perform poorly, it may be difficult to select a large malware set on which they agree to belong to the same families. Moreover, although domain knowledge about mal ware families is not assumed as a priori information to execute our proposed method, it is important to interpret the clusters identified by our technique. This has been exemplified by our earlier explanations of the relationship between the Z lob and Vap s up families, and that between the Rbot and S dbot families. Domain knowledge can be further leveraged to improve the accuracy of our method. If for the same malware family we know the correspondence between the family names used by different AV software, we can directly select malware instances for this family before executing our method. On the other hand, if we are sure that any correspondence between the family names used by two or more AV software must be wrong (e.g. , a Swizzor instance classified by NOD32 cannot be a Vundo variant by McAfee), we can rule out such malware instances from the clustering matrix in advance. Also, other heuristics can be incorporated into our method to improve clustering accuracy. For example, a row with few instances in the condensed clustering matrix can be removed from malware selection, if such malware instances are likely misclassified by at least one AV software, or their values for later per-family analysis are limited.

IX. CONCLUDING REMARKS

This work tackles a fundamental problem in data mining and engineering: how to find common ground among experts' opinions on data clustering? Driven by demands from real­world applications on high confidence in how data objects are grouped together, we formulate the problem rigorously and show that it is NP-complete. We then propose a light­weight technique to select data objects which can be clustered without leading to conflicts among experts' opinions. With a

Page 13: Finding Common Ground among Experts' Opinions on Data ...ghyan/papers/icde14.pdf · on Data Clustering: with Applications in Malware Analysis Guanhua Yan Information Sciences Group

malware dataset that contains hundreds of thousands of mal­ware instances, we apply our proposed method to find common ground among the clustering results by major AV software. Our work offers a new direction for consensus clustering by striking a balance between the quality of consensus clustering and the amount of data objects that are chosen to be clustered.

REFERENCES

[1] M. Analoui and N. Sadighian. Solving cluster ensemble problems by correlations matrix & gao Intelligent Information Processing Ill, 2007.

[2] J. Azimi, M. Abdoos, and M. Analoui. A new efficient approach in clustering ensembles. In Intelligent Data Engineering and Automated

Learning-IDEAL 2007. pages 395-405. Springer, 2007. [3] M. Bailey, J. Oberheide, J. Andersen, Z. M. Mao, F. Jahanian, and

J. Nazario. Automated classification and analysis of internet malware. In Proceedings of the 10th international conference on Recent advances in intrusion detection, 2007.

[4] U. Bayer, P. M. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda. Scalable, behavior-based mal ware clustering. In Proceedings of the 1 6th Annual Network and Distributed System Security Symposium, 2009.

[5] P. Berkhin. A survey of clustering data mining techniques. In Grouping multidimensional data, pages 25-7 1. Springer, 2006.

[6] B. Bresar, F. Kardos, J. Katrenic, and G. Semanisin. Minimum k-path vertex cover. Discrete Applied Mathematics, 159(12), July 20 11.

[7] http://www.caro.org/naming/scheme.htmI. [8] y-c. Chiou and L. W. Lan. Genetic clustering algorithms. European

fournal of Operational Research, 135(2):4 13-427, 200 1. [9] D. Cristo for and D. A. Simovici. An information-theoretical approach to

clustering categorical databases using genetic algorithms. In Proceedings the 2nd SlAM 1CDM Workshop on clustering high dimensional data, pages 37-46, 2002.

[ 10] S. Dudoit and J. Fridlyand. Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19(9) : 1090-1099, 2003.

[ 11] http ://www.symantec.com/security _response/writeup .j sp ?docid= 2007-07290 1- 5957-99.

[ 12] X. Z. Fern and C. E. Brodley. Solving cluster ensemble problems by bipartite graph partitioning. In Proceedings of the twenty-first international conference on Machine learning, 2004.

[13] B. Fischer and 1. M. Buhmann. Path-based clustering for grouping of smooth curves and texture segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 25(4) :5 13-5 18, 2003.

[ 14] A. Fred. Finding consistent clusters in data partitions. In Multiple Classifier Systems, pages 309-3 18. Springer, 200 1.

[ 15] A. Fred and A. K. Jain. Data clustering using evidence accumulation. In Proceedings of the 1 6th International Coriference on Pattern Recog­nition, volume 4, pages 276-280. IEEE, 2002.

[ 16] http://www.symantec.com/security _response/writeup .j sp ?docid= 2006- 1 1 1201- 3853-99.

[ 17] R. Ghaemi, M. N. Sulaiman, H. Ibrahim, and N. Mustapha. A survey: clustering ensembles techniques. World Academy of Science, Engineering and Technology, 50:636-645, 2009.

[18] A. Goder and V. Filkov. Consensus Clustering Algorithms: Comparison and Refinement. In Proceedings of the 9th Workshop on Algorithm Engineering and Experiments, 2008.

[ 19] http://developer. yahoo.comlhadoop/tutoriall. [20] Symantec Incorporation. Internet security threat report 20 11 trends.

20 12. [21] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review.

ACM computing surveys (CSUR), 3 1(3):264-323, 1999. [22] J. Jang, D. Brumley, and S. Venkataraman. Bitshred: feature hashing

mal ware for scalable triage and semantic analysis. In Proceedings of the 18th ACM coriference on Computer and communications security, 20 11.

[23] D. Kong and G. Yan. Discriminant malware distance learning on struc­tural information for automated mal ware classification. In Proceedings of the 19th ACM Coriference on Knowledge Discovery and Data Mining (KDD '13), 2013.

[24] T. Li and C. Ding. Weighted Consensus Clustering. In Proceedings of 2008 SIAM International Conference on Data Mining (SDM 2008).

27

[25] H. Luo, F. Jing, and X. Xie. Combining multiple clusterings using information theory based genetic algorithm. In Proceedings of the IEEE International Conference on Computational Intelligence and Security, 2006.

[26] Microsoft security intelligence report, January-June 2006. [27] L. Nataraj , V. Yegneswaran, P. Porras, and J. Zhang. A comparative

assessment of malware classification using binary texture analysis and dynamic analysis. In Proceedings of the 4th ACM workshop on Security and artificial intelligence, 20 1 1.

[28] http://www.offensivecomputing.net/. [29] http://www.honeynet.org/node/53. [30] H. Shachnai and A. Srinivasan. Finding large independent sets in graphs

and hypergraphs. SIAM fournal on Discrete Mathematics, 18(3), 2004. [3 1] A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse

framework for combining multiple partitions. fournal of Machine Learning Research, 3 :583-617, March 2003.

[32] A. Topchy, A. K. Jain, and W. Punch. Combining multiple weak clusterings. In Proceedings of the Third IEEE International Coriference on Data Mining, 2003.

[33] A. Topchy, A. K. Jain, and W. Punch. A mixture model of clustering ensembles. In Proc. SIAM 1ntl. Con! on Data Mining, 2004.

[34] https://www.virustotal.coml. [35] c. Wang, Z. She, and L. Cao. Coupled clustering ensemble: Incorporat­

ing coupling relationships both between base clusterings and objects. In Proceedings of the 29th IEEE International Conference on Data

Engineering, 20 13. [36] R. Xu and D. Wunsch. Survey of clustering algorithms. IEEE

Transactions on Neural Networks, 16(3):645-678, 2005. [37] G. Yan, N. Brown, and D. Kong. Exploring discriminatory features for

automated mal ware classification. In Proceedings of 10th Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA '13), 20 1 3 .

[38] Y Zhao and Q. Zhu. Evaluation o n crowdsourcing research: Current status and future direction. Information Systems Frontiers, 2012.

[39] http://www.sophos.comlen- us/threat - center/threat -analyses/ viruses- and- spywarelTroj� Vapsup-AD/detailed-anal ysis.aspx.