machine learning methods for data security

141
Machine learning methods for data security Author: József Hegedüs Supervisor: Prof. Pekka Orponen Instructor: Doc. Amaury Lendasse 2012

Upload: jozsef-hegedus

Post on 01-Oct-2015

25 views

Category:

Documents


6 download

DESCRIPTION

Machine learning methods for data security

TRANSCRIPT

  • Machine learning methods for data security

    Author: Jzsef Hegeds

    Supervisor:

    Prof. Pekka Orponen

    Instructor:

    Doc. Amaury Lendasse

    2012

  • ,QWUXVLRQGHWHFWLRQV\VWHPV

    KRVWEDVHG QHWZRUNEDVHG

    EHKDYLRUEDVHG

    EODFNOLVWLQJ7+,6:25.

    VLJQDWXUHEDVHG

    ZKLWHOLVWLQJ

  • ##

    0.02%

    85%

    0.004%

    40%

    3.80%

    97.11%

    k1

    1

    n10

    1%86%

    0.3%

    74.37%

    2.5%

    98.6%

    0.002%

    40%

    0%103

  • 105q

    4000

  • 71%

    21105

  • NA tC1, C2, ..., Ci, ..., CNu.A tC1, C2, ..., Ci, ..., CNu Ci

    Ci Ci t..., ppn, hnq, ...upn hn

    gn ppn, hnq Ppn pn P P P

    42

    59

  • i pnhnq

    pn3

    8 83

    h p1000

    Ci Dpi p P P

    Dpi pp, hnq pDpi pp, hnq p Dpi

    yppq |iDpi |59

    59

  • thn | pp, hnq PCiu H p P P i p

    D59ix t 103

    30 600

    x p yp

    yppq |iDpi | N 10001000

  • y x59

    ypxq |ti : |D59i | xu| 50005000

  • 0 100 200 300 400 500 600 700 800 900100

    101

    102

    103

    104

    105

    106

    X: 335Y: 27

    number of files

    num

    ber o

    f has

    hes

    number of (unique) hashes which occur exactly in a given number of files(not neccessarily the same files)

    restricted to cleansamples only

    h 59

    33527

    59

    ypxq te P

    i

    D59i : |ti : e P D59i u| xu .

  • S pmhm

    vm

    Abal

    SS ppm, hm, vmq

    mS V vm P V

  • Abal vv

    |V | 24ppn, hnq P Ci ppm, hm, vmq P S

    ppm, hmq ppn, hnq i vmAbal

    86000

    A86k 86000

    A86k

    10030 v

    Abal 1700Abal

    Abal A86k

  • vvy v

    v v

    0

  • xi zi zi

    i xx y AV T zi x

    32683

    32683A86k AV T

  • AV T

    32683AV T

    25000 1

    zi iAV T01 AV T AV T01 tCi P AV T : zi 0u Y tCi P A : zi 10u

    zii AV T01 zi 0 zi 0i zi 10 zi 1 10

    A86k

    Alarge Alarge

    Alarge

  • gn ppn, hnqAlarge

    0 Jij 1i j Jij Ci Cj Ci

    Jiji j

    Dpi pi Dpi

    p DpiDi

    p

  • Jcosij |Di XDj |a|Di||Dj | .p

    ppJcosij 0 1

    LD iDi

    Dg pg1, g2, . . . , gl, . . . , gLq,

    L |D|.g 0 1 xi,

    Di

    xi pxi1, ..., xil, ..., xiLq.xil

    xil #1 gl P Di0 gl R Di .

    Jcosij xi xj

    Jcosij x|i xjxixj .

    L N 105Jcosij

    xi

    xi

    xi{xi

  • xi{xi Di

    xi{xi

    i, j

    JJacij |Di XDj ||Di YDj | .

    xil

    xil #wl gl P Di0 gl R Di

    wl

    wl logpN{NlqN Nl

    gl

    logpN{0q 0 0

    wl log NNl ` 1 ,

    logpN{0q 0 0

  • xi

    xi Rxi,R d L

    0 2 d1

    EpR|Rq I Ex|i xj x

    |i xj

    Epx|i xjq Epx|iR|Rxjq x|i EpR|Rqx|j x|i xj |Di XDj|.d x|i xj

    x|i xj |Di X Dj | x|i xjd

    d 4000n

    k Op2 logpnqq1 1 `

  • N 106 LN 105

    Rxi

    d xi RR

    xi dxi k xi

    rxisk RNDp,qN p,q xi Rxi

    k 1 drxisk 0

    j P DiRND j

    k 1 dr RNDp 0,2 d1qrxisk rxisk ` r

    g D ED En n E

    g D

    n pEq #En |E| E,

    : g g gMpDiq n p pDiqq

  • zJJacij | n pMpDiq YMpDjqq XMpDiq XMpDjq|| n pMpDiq YMpDjqq| .zJJacij JJacij

    pDiq

  • 3x d W tw1,w2, ...,wMu

    sij

    x

    W H x W rw x}rwx}

  • x W x xsij aiaj ai

    rwTp

    wi hipxwiq

    hi aiNk1 ak

    ai e22}xwi}2 .

    sij sij ` p1 qaiaj .

    k

  • 105

    s

  • Jij jMkpjq k

    j Mkpjq Jij kMkpjq Mkpjq

    j

    MkpjqJij s

    MkpjqJij s

    Mkpjqs

    k 1

  • d1NN

    R R 2

    R 3

    d

    pq `1`

    p1 q `1` .

  • Abal

    xi 100100

    p P Pi ixpi |Dpi | Dpi

    xi

    pPPi |Dpi |1xpi}pPPi |Dpi |1xpi } ,

    Pi Ci

    100

    100

  • Abal

  • Abal Abal

    Jij

  • 1N

    k

    xPQk

    }x qk}

    Qk k

    A86k 2000

    1

    100 1.3

    A86k 10000 A86k

    0.6 1.1

  • 50Tp 50

    0.8 0.2 0.8 Tp 50 N 100005 0.8 0.2 0.8

    Tp 50 N 100002700

    20k 2700 0.328 0.363 4

    0.8

    1.2 0.2 0.8 Tp 50 N 10000 1.2 0.2

    0.8 Tp 50 N 10000 1.2

    1{0.15 6.6

    1.5

    1 0.2 0.8 Tp 50 N 10000 1 0.2 0.8 Tp 50 N 10000

    70

    0.71

  • " 0.8

    1.2 6

    k 70 0.63

    1.2Tp

    1 0.2 0.8 Tp 200 N 86000N 86000

    7 142k 142 28 100

  • 0.67 0.60

    59

  • 25610244096

    16384

    256512

    10242048

    true negatives

    14

    1664

    256

    false negatives

    575674839728

    1264616440

    true positives

    0 10 20 30 40 50 60 70 80 90 1001664

    2561024

    false positives

    Number k of nearest neighbors

    unpredictable

    Amou

    nt

    k

    kAV T01

    300s 016{2441 0.65% k 100

    6000{18612 32%

  • d 4000Alarge

    NtrainqNval

    112000

    210540%

  • d 4000d 4000

    500

    104

    105q

    Nvalk

    0

  • 1500 1500

    1500 1500

    1500 15001 6.66 104 0.0666%x15001 6.66 104

    302{10073 2 104

    45.5%

    56% 2 104

  • s 10% 45.5%

    11

    104 103 102 1010.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    false positive rate

    true

    posit

    ive ra

    te

    2009.012009.022009.032009.042009.052009.062009.072009.082009.092009.102009.11

    11k

    k 1

    d 1000 p 59

  • s 011

    11

    10%103

    Dtrainingclean Dtrainingmalware

    Dvalidationclean Dvalidationmalware

    D4000

  • Btrain Dtrainingclean Y DtrainingmalwareBval Dvalidationclean Y Dvalidationmalware|Btrain| 200 |Bval| 200

    x 1001 yx 1001

  • s0

    s 0.1s 0.1

    s 0

    s 0s 0.1

  • s 0

    s 0

  • ri ti ti 1500059

  • Dpip i

    D3i

    Dpi iD59i

    iti 1 i

    91 0 i91

    18% 9115000

    1500`1

    a b titi

    ti axi ` bxi 500 D59i

    a bipti tiq2

    ri ti ti0

    15000 500 ` 159

    59

    xixi

    1 xi aDi

    tiDi i 91

    xi91

  • OpkNq kN

    OpN2q

    411%

    OpMqM

    OpNq

    10%103

    50% 50%

  • 59

  • Methodology for Behavioral-based Malware Analysis and Detectionusing Random Projections and K-Nearest Neighbors Classifiers

    Jozsef Hegedus, Yoan Miche, Alexander Ilin and Amaury LendasseDepartment of Information and Computer Science,

    Aalto University School of Science,FI-00076 Aalto, Finland

    AbstractIn this paper, a two-stage methodologyto analyze and detect behavioral-based malware ispresented. In the first stage, a random projection isdecreasing the variable dimensionality of the problemand is simultaneously reducing the computationaltime of the classification task by several orders ofmagnitude. In the second stage, a modified K-NearestNeighbors classifier is used with VirusTotal labelingof the file samples. This methodology is applied toa large number of file samples provided by F-SecureCorporation, for which a dynamic feature has beenextracted during DeepGuard sandbox execution. Asa result, the files classified as false negatives are usedto detect possible malware that were not detected inthe first place by VirusTotal. The reduced number ofselected false negatives allows the manual inspectionby a human expert.

    I. IntroductionMalware detection has been the subject of a large

    number of studies (see [1], [2], [3] and [4], [5], [6], [7],[8]), for example the work of Bailey [9] using signature-based malware detection approach has shown that recentmalware types require additional information in order toobtain a good detection.In this paper, an approach based on the extraction of

    dynamic features during sandbox execution is used, assuggested in [7]. In order to measure similarities betweenexecutable files, the Jaccard Index is used to measure thesimilarities between hash values (encoding the dynamicfeature values obtained from the sandbox). The hashvalues are transformed into a large number of binaryvalues which could be used to compute the Jaccard Index(see [10] for original work in French or [11] in English).Unfortunately, the dimensionality of such variable spacedoes not allow the use of traditional classifiers in areasonable computational time.A two-stage methodology is proposed to circumvent

    this dimensionality problem. In the first stage, a randomprojection is decreasing the variable dimensionality ofthe problem and is simultaneously reducing the com-putational time by several orders of magnitude. In thesecond stage, a modified K-Nearest Neighbors classifieris used with VirusTotal [12] labeling of the file samples.This two-stage methodology is presented in section III.

    The practical implementation of the methodology andthe results are discussed in section IV. The dierentparameters (the random projection dimension and thenumber of nearest neighbors) are also analysed in thissection.As a global result, the methodology enables to identify

    the false negatives from the classification. Such samplescan then be used to detect possible malware that werenot detected in the first place by the VirusTotal labeling.Thanks to the methodology, the reduced number ofidentified false negatives allows for a manual inspectionby a human expert.Indeed, without this pruning of possibly malicious

    samples by the presented methodology, a manual inspec-tion will not be possible since reliable experts are scarceand their availability is highly limited.Using the proposed methodology and the know-how

    of one F-Secure Corporation expert, it has been possibleto extract 24 malware candidates out of 2441 originalcandidates from which 25% are surely malicious and50% which are probably malicious, have to be furtherinvestigated in order to obtain a decisive classification.In section II, the data gathering and sample labeling

    are described. Section III presents the two-stage method-ology while section IV shows the practical implementa-tion, the results and the analysis of the results.II. Behavioral Data Gathering and Sample

    labelingThe data set used in this paper is focused on behavior-

    based malware analysis and detection. The former ap-proach of signature-based malware detection cannot beconsidered as sucient anymore for reliable detection[9], [7]. Be it because of the development of polymorphicand metamorphic malware or the approach of flashworms who only do some reconnaissance on themachines/network they scan for future deployment oftargeted attacks , the need for execution level identi-fication is important.A. Sandboxing and Extracting Behavioral FeaturesIn this spirit, a currently popular approach [7], [6] is

    to sandbox the execution of the malware and analyzebehavioral data extracted during the execution.

  • It has recently been demonstrated in [8] that the use ofpublic sandbox submission systems might reveal networkinformation regarding the sandbox machine identity.Through submission of a decoy sample by an attacker,it becomes possible to blacklist the hosts on which aresandboxed the samples and have the malware circumventthe sandbox execution and forth detection.The Norman sandbox development kit [13] released in

    2009 enables security companies to gather the behavioraldata obtained during sandboxed execution and analyzethat data with a custom engine. This avoids the pitfallof a publicly available sandbox machine mentioned.The results in this paper were obtained on the data of

    32683 samples collected by F-Secure Corporation. Thesamples data were produced by F-Secure by running thesamples through their sandbox engine [14], [15], [16],which resulted in large numbers of feature-value pairsextracted for each sample. Individual features may havesignificant number of distinct values, and the valuescome in the form of hashes. The data cannot be con-sidered complete, as the sandbox, for instance, may notbe able to run some of the samples correctly or may missrelevant execution paths.The samples were labeled using an online sample

    analysis tool explained in the next section.

    B. Obtaining the Sample labelingThe VirusTotal [12] online analysis tool provides a

    simple interface for sample submission, returning a listof up to 43 (depending on the sample nature: executable,archive. . . ) mainstream anti-virus software detection re-sults. Among the most widely used and known are F-Prot, F-Secure, ClamAV, Antivir, AVG, BitDefender,eSafe, Avast, McAee, NOD32, Norman, Panda, Syman-tec, TrendMicro, VirusBuster. . . See the VirusTotal website for the full list of used engines [12].The result of the submission of a sample file is the

    number of engines which detected the sample as mal-ware. Figure 1 is a histogram of the detection levelsfor the set of 32683 samples used in this paper. As canbe seen, a large proportion of the set is detected by atleast one engine as malware. Less than 2500 samples areactually not detected by any engine.In order to make the problem a binary classification

    one (i.e. identifying whether a sample should be consid-ered malware or clean), an a priori and arbitrarythreshold has been set on the amount of engines de-tecting a sample as malware. It is considered that fora sample i, if the amount mi of engines identifying thesample as malware is such that 0 < mi < 11, thenthe sample is discarded. The disadvantage is that thesesamples are not considered in the whole methodologyand therefore not classified. Nevertheless, they have also

    -5 0 5 10 15 20 25 30 35 40 450

    500

    1000

    1500

    2000

    2500

    Number of engines detecting malware

    Num

    ber o

    f sam

    ples

    Figure 1. Histogram made using 32683 executable samples andquerying from www.virustotal.com how many anti-virus enginesraise a flag for each sample. Thus for each sample k a number mkis obtained. For a given value x, on the x-axis, the y-axis shows forhow many samples k it is true that mk = x.

    no influence on the rest of the data set and the finalclassification results.This is equivalent to setting a certainty threshold on

    the sample analysis, above which it can be consideredas indeed malware (and no more a set of false positivesfrom mi dierent engines). Therefore, samples with anumber mi of detecting engines strictly above 10 arekept and considered as malware (with a relatively highprobability), and samples with 0 detecting engines arekept and considered as unpredictable (and possiblyclean).Figure 2 illustrates the pruned set of samples, with

    only samples for which mi = 0 or mi > 10 are kept,which amounts to 21053 (out of the original set of32683): 18612 considered as malware, and 2441 aspossibly clean.It is clear that flagging the 2441 samples for which

    mi = 0 as possibly clean is likely to hide a certainamount of false negatives (VirusTotal clearly states thatmi = 0 should in no way be considered as meaningclean). The meta goal of this paper is to actuallyidentify such samples which are potential false negatives,using a methodology based on the Jaccard similarity[11], [10] measure and K-Nearest Neighbors classifiers.

    III. MethodologyThe overall process can be summarized by Figure 3,

    with the dynamic feature extraction described in theprevious section, followed by the actual methodologyto identify potential false negatives, using a RandomProjection approach and K-Nearest Neighbors classifiers(described in detail in sections III-C and III-B).

  • -5 0 5 10 15 20 25 30 35 40 450

    500

    1000

    1500

    2000

    2500

    Number of engines detecting malware

    Num

    ber o

    f sam

    ples

    Figure 2. The mk distribution for the samples used for thishistogram is identical to Figure 1 with the important dierencethat samples such that 0 < mi < 11 are discarded. Here 2441samples are depicted that can be considered as clean (mi = 0),and 18612 samples that can be considered as malicious (mi > 10).

    Sample

    DynamicBehavioralFeature

    Sandbox

    RandomProjectionKNN

    Malware

    Clean

    Figure 3. Global schematic of the methodology: a sample isrun through the sandbox to obtain a set of dynamic features; therandom projection approach then reduces the dimensionality ofthe problem while retaining most of the information conveyed bythe original feature; finally, a K-Nearest Neighbors classifier in therandom projection space gives prediction on the studied samplebeing malware or not.

    A. Measuring Similarity between Executables

    In this section, an approach for measuring similaritiesbetween executables is detailed. Let Ai denote the set ofhash values (produced by the sandbox) for file i.Then, the JJaccard Jaccard similarity between two

    executables i, i is calculated as

    J i,i

    Jaccard =

    ---Ai Ai ---|Ai Ai | . (1)

    Similarly, the Jcosine cosine similarity is given by

    J i,i

    cosine =

    ---Ai Ai ---|Ai| |Ai | . (2)Note that the Jcosine cosine similarity is expressed as ascalar product.Denote by

    A =Ni=1

    Ai = {a1, a2, ..., aD}, (3)

    where N is the total number of samples and D is thetotal number of unique hashes seen in all samples.Then from an ordering of set A, N binary (0,1 valued)

    vectors Bi can be constructed, each of K dimensionssuch that ---Ai Ai --- = Bi,Bi, (4)and

    --Ai-- = ..Bi..2. Here denotes vector norm and, denotes scalar product. Since Bi is a binary vector(with coordinates 0, 1 only),

    ..Bi..2 is the number of thecoordinates in Bi that are equal to 1.So, the normalized scalar product of Bi and Bi gives

    the cosine similarity:

    J i,i

    cosine =Bi,Bi..Bi.. ...Bi... . (5)

    Using the relationship between Euclidean distance

    Deuclidean =...Bi Bi... (6)

    and cosine similarity in the case of..Bi.. = 1 and...Bi... = 1, it appears that

    Jcosine =2D2euclidean

    2 . (7)

    From Equation 7 it appears that a classification orclustering based either on the cosine similarity or on theEuclidean distance will yield the same result if the normof the feature vectors is unity.

    B. K-Nearest Neighbor ClassificationIn this section, a standard method (K-NN, see for ex-

    ample [17], [18], [19], [20]) is described; it can be used topredict whether an unknown executable is malicious orbenign. The essential assumption of the method is thatmalicious (resp. clean) executables are surrounded bymalicious (resp. clean) executables in the D dimensionalEuclidean space spanned by the normalized vectors

    Bi..Bi.. , (8)

  • with Bi the binary vectors defined in the previous sec-tion. This means that the more hashes two samples havein common the closer they are in this space (assumingthat the number of hashes in the two samples does notchange).Let us denote the set of k nearest neighbors of sample

    i by N ik. The classification is based on the data providedby VirusTotal, that is how many anti-virus engines haveconsidered a given executable as malicious. Let us denotethis number by mi for sample i. In the results sectionis examined how well the mi of the neighboring samplesN ik can actually predict if the sample i in question ismalicious or clean.It is important to mention that to predict if a sample

    i is malicious or not, only neighboring samples are usedand not the sample itself. This corresponds to a Leave-One-Out [21], [22], [23], [24] (LOO) classification ratewhen it comes to assessing the accuracy of the K-NNclassifier in the Results section. In [21], [22], it is shownthat the Leave-One-Out estimates well the generaliza-tion performances of a classifier if the number of samplesis large enough, which is the case in the experiments.As the dimensionality of Bi is too large, random pro-

    jections are used in order to reduce this dimensionalityand therefore reduce the needed computational timeand memory by several orders of magnitude. Randomprojections are explained in the following section.

    C. Random ProjectionsAs mentioned earlier the cosine similarity is calculated

    asJ i,i

    cosine =

    Bi,Bi..Bi.. ...Bi... . (9)However, for practical purposes storing the vector Bi isinconvenient as it requires too much memory (even ifstored as a sparse vector). The reason for this is that D,the dimensionality of Bi is in the range of a few millions.In order to alleviate this memory (and the related time)complexity, random projections are used. For the matterof projecting to a lower dimensional space, Johnson andLindenstrauss [25] have shown that for a set of N pointsin d-dimensional space (using an Euclidean norm), thereexists a linear transformation of the data toward adf -dimensional space, with df O(2 log(N)) whichpreserves the distances (and hopefully the topology ofthe data) to a 1 factor. Achlioptas [26] has recentlyextended this result and proposed a simpler projectionmatrix that preserves the distances to the same factorthan the Johnson-Lindenstrauss theorem mentions, atthe expense of a probability on the distance conser-vation. For theory and other applications of randomprojections in machine learning and classification, see forexample [27], [28], [29], [30], [31].

    To describe the random projection approach, let m Ai, and

    Xim = [Xim,1, Xim,2, . . . ,Xim,d],Xim N (0, I) (10)such that Xim Xi

    m if m = m, however, if m = m

    then Xim = Xim . N (0, I) represents a d-dimensional

    standard normal distribution for which the covariancematrix is the identity matrix, I.Then, for each file i the corresponding random pro-

    jection is the d-dimensional random vector Yi definedas

    Yi = 1d

    1|Ai| mAi

    Xim. (11)

    The scalar product of the random vectors gives thesimilarity J , which is a scalar valued random variable.Using

    J i,i = Yi,Yi (12)

    and the definition of Yi, one can see that Pr(J i,i = 1) =1. Also if file i and i do not have any hashes in common,i.e. Ai Ai = , then E(J i,i) = 0.As an illustrative example, let us calculate the ex-

    pected similarity, E(J i,i) by assuming that--Ai-- =---Ai--- = l and ---Ai Ai --- = k. Note that the Jaccard

    distance between i and i in this case is k/l. Also, dueto independence

    E(Xim,Xim) = 0 m = m. (13)

    On the other hand, the following scalar product (incase of matching hashes, m = m) has the chi-squaredistribution:

    Xim,Xim 2(d) (14)

    where 2(d) denotes the chi-square distribution with d-degree of freedom, whose expectation value is d. Sinceonly the Xim,Xi

    m terms contribute to E(J i,i

    ) it canbe deduced that

    E(J i,i) = kl

    (15)

    which agrees with the Jaccard and cosine similarity inthis case. Note that in general if

    --Ai-- = ---Ai --- thenE(J i,i) = JJaccard but still E(J i,i) = Jcosine. There-fore, the Jaccard index is approximated using the cosinesimilarity approach defined previously.

    IV. ResultsIn this section, Euclidean distance is used in the d-

    dimensional space spanned by the random projectedrepresentations Yi of the samples. As noted earlier, theuse of Euclidean distance instead of cosine similaritydoes not change the results presented in this section asPr(J i,i = 1) = 1. The Yi are normalized to unity.

  • -5 0 5 10 15 20 25 30 35 40 450

    200400600800

    100012001400160018002000

    Mean of the number of detecting engines of the 10 nearest neighbors

    Num

    ber o

    f sam

    ples

    number of detecting engines = 0number of detecting engines >10

    Figure 4. Illustration of the prediction accuracy of the K-NNmethod: Histogram of the number of detecting engines for k = 10nearest neighbors.

    A. Accuracy of K-NN ClassifierAn illustration of the prediction accuracy of the K-

    NN method (see section III-B) is shown in Figure 4, anddescribed in detail in the following.Let N i10 be the set of 10 nearest samples to sample i,

    then the prediction of the K-NN method for mi is themean mi of values {mi : i N i10} expressed as

    mi =--N i10--1

    iNi10mi . (16)

    For a given value x on the x-axis, the height of the bar ony-axis shows for how many samples mi = x, i.e. y(x) =|{i : mi = x}| is true.The question is how well the number of detecting

    engines mi given by VirusTotal compare with theirpredicted values, mi. In order to answer that question,the samples are divided into two categories: category 1as supposedly clean (i.e. mi = 0) and category 2 assupposedly malicious (i.e. mi > 10). They are shownin Figures 4 and 5. Assuming that mi = 0 means thatsample i is predicted to be clean and that mi > 10 thatsample i is predicted to be malicious, there would bea considerable amount of false positives . The numberof false positives can be reduced by introducing a thirdclass into the K-NN classifier: unpredictable. The nextsection details the results obtained using this additionalthird class and a modified K-NN.B. Accuracy of Modified K-NN ClassifierFigure 5 shows the prediction accuracy of the modified

    K-NN classifier. Now, the K-NN classifier has 3 classes:predicted to be clean, predicted to be malicious, un-predictable.A sample i is classified as clean if mi = 0. It is

    classified as malicious if mi > 10 : i N i10, i.e. ifall the 10 nearest neighbors N i10 of i are supposedlymalicious. A neighboring sample is considered suppos-edly malicious if mi > 10, i.e. if it has been flaggedas malicious by more than 10 AV-engines. Furthermore,

    -5 0 5 10 15 20 25 30 35 40 450

    200

    400

    600

    800

    1000

    1200

    1400

    Mean of the number of detecting engines of the 10 nearest neighbors

    Num

    ber o

    f sam

    ples

    number of detecting engines = 0number of detecting engines >10

    true negatives

    false positives

    true positives

    false negatives

    Figure 5. Prediction accuracy of the modified K-NN classifier.

    a sample i is considered to be unpredictable if it doesnot fulfill the requirement to be classified as clean ormalicious. In the production of the histogram depictedin Figure 5, samples that are unpredictable are omitted.In Figure 5, the concepts of false negative, false positive,true positive and true negative are illustrated.Introducing the unpredictable class considerably im-

    proves the prediction accuracy for the two other classes.This improvement is due to the fact that the uncertaintyon the neighbors is used to separate the predictable andunpredictable samples. An unpredictable sample is asample i, such that not all of its neighbors are eithersupposedly malicious (i.e. mi > 10) or supposedlyclean (i.e. mi = 0).

    25610244096

    16384

    256512

    10242048

    true negatives

    14

    1664

    256

    false negatives

    575674839728

    1264616440

    true positives

    0 10 20 30 40 50 60 70 80 90 1001664

    2561024

    false positives

    Number k of nearest neighbors

    unpredictable

    Amou

    nt

    Figure 6. The entries of the confusion matrix (false positive, falsenegative, true positive and true negative) are plotted in this figureas a function of the parameter k, the number of nearest neighbors.In addition, the number of unpredictable samples is represented.

    C. Influence of the Number of Nearest Neighbors in theModified K-NN Classifier on the Confusion MatrixIn Figure 5 are illustrated the notions of false posi-

    tive, false negative, true positive and true negative. A

  • prediction for a sample i is considered to be a falsepositive if mi > 10 : i N ik and mi = 0 are trueat the same time. This means that all the k-nearest-neighbors N ik of sample i are supposedly malicious(mi > 10 : i N ik), however, sample i itself isconsidered to be supposedly clean (mi = 0). Similarly,true positive means that mi > 10 : i N ik andmi > 10 are true for sample i. Furthermore, falsenegatives are characterized by mi = 0 : i N ik andmi > 10, while a true negative is a sample i for whichmi = 0 : i N ik and mi = 0 holds.The entries of the confusion matrix (false positive,

    false negative, true positive and true negative) are plot-ted in Figure 6 as a function of the parameter k, thenumber of nearest neighbors. Sample i is unpredictableif neither mi = 0 : i N ik nor mi > 10 : i N ik istrue. The number of unpredictable samples increasesmonotonically with increasing k, this must be so asincreasing k by one introduces an additional conditionthat has to be fulfilled in order for a sample to beclassified as predictable . In fact, if a sample is labeledas unpredictable for k, it cannot become predictablefor k + l, l > 0.In Figure 6, one can note that the number of false

    and true negatives stops decaying at k = 40. However,at k = 40 the number of true and false positives are stilldecaying at a rapid rate. The reason for this dierencemight be that there are much less supposedly cleansamples than supposedly malicious ones. Also, thecluster size distribution might be dierent for these twocategories, which could manifest itself in these dierentdecay behaviors in Figure 6.Figure 6 can be used to choose the parameter k that

    fits the needs of the user of the modified K-NN method.Furthermore, note the dierence in the decay exponentsfor true and false positive rates. If k is increased from 2 to100 the number of true positives decreases from 17150 to6204, while the number of false positives decreases from531 to 17. The decrease in the true positives is 64% whilethe decrease in false positives is 97%. So if one wantsto increase the true positive/false positive ratio then itsadvisable to increase the number of neighbors, k. On theother hand one should not forget that by increasing k onealso increases the number of unpredictable samples. Inorder to limit this amount of unpredictable samples, thenumber of nearest neighbors k to use has been chosenas 11 for the final detection of the false negatives.

    D. Influence of the Random Projection Dimension onthe Confusion MatrixIn the previous section, the dependency of the con-

    fusion matrix with respect to the number of neighborsis discussed and the dimension of the random projectedvectors is fixed to be d = 300. In this section, the eects

    of varying d on the confusion matrix are investigated.In order to have a very small number of false negativesand to demonstrate the influence of d, the number ofneighbors k is chosen to be 30 in this section. Figure7 shows the dependency of the confusion matrix onthe number of dimensions d of the projected vectors.Clearly, increasing d improves the results: the number ofunpredictable samples decrease while the true positivesincrease and the false positives decrease.The true and false negatives do not change much with

    increasing d. This might be related to the fact that atk = 30 the decay of true and false negatives in Figure 6has almost completely stopped. So, even though the lowvalue of d = 300 might mean that the distances in thed = 300 dimensional Euclidean random projected spaceare noisy compared to the D > 106 dimensional originalspace. The samples that are true and false negatives areinsensitive to this noise.Figure 7 indicates that convergence in all confusion

    matrix elements can be reached by using d = 700. Byincreasing d even more, no significant improvement isobserved.The necessity to use the random projection method is

    almost unavoidable: if one would like to use the originalspace (with dimensionality D > 106) the complexity ofthe problem (in terms of memory and computationaltime) can become an issue as D has been as high as5.108 in other related experiments. In this situation, ifone wishes to calculate distances between vectors in theoriginal space then all the data needs to be located inthe memory (since the original space is spanned by allthe hashes produced by the sandbox). Furthermore, herea set of samples of cardinality of the order of 104 asbeen considered. However, future experiments will be onthe scale of 106 samples, where using the original spacemight become prohibitive.The total computational time needed to run the

    methodology on the 21053 samples is a few hours usingPython implementation of the random projections andK-NN. In comparison, without the random projectionapproach, the computational time would be estimatedto take few weeks, due to the dimensionality of Bi.Finally, based on these results one might improve the

    previously presented random projection method by usingdierent number of dimensions d for each pair of distancecalculated. One could treat larger distances with lessaccuracy (lower d) while treating smaller distances withbetter accuracy (higher d). This is a possible directionfor future research.

    E. Manual Analysis by a Human Expert and FurtherWorkUsing d = 100 projection dimension and a modified K-

    NN with k = 11, 24 false negatives have been extracted

  • 0.9

    1.1

    1.3 x 104

    315

    325335

    true negatives

    2

    2.53

    false negatives

    80001000012000

    true positives

    0 100 200 300 400 500 600 700406080

    100

    d, dimension of the random projected vectors

    false positives

    unpredictable

    Amou

    nt

    Figure 7. Dependency of the elements of the confusion matrix withrespect to the number of dimensions d of the projected vectors.

    out of the 2441 possibly clean files. This reducednumber allows the manual analysis by a human expert.According to an F-Secure Corporation expert, 25% ofthese 24 files are surely malicious. 50% have a relativelyhigh probability to be also malicious. The remaining 25%are considered as clean by the expert.Even with such a reduced number of candidates,

    a human analysis is taking time and has high costs(especially if the 50% of unsure samples have to befurther investigated). This shows the usefulness of thepresented methodology since it would be impossible tofind enough highly qualified experts to analyze the initial2441 possibly clean files.The same methodology will be applied in the future

    using dierent labeling than the one provided by Virus-Total. Also, dierent dynamic features will be investi-gated and eventually combined with some static features(code signatures, packer information. . . ), and possiblyother types of malware in the sample set.

    V. ConclusionIn this paper, a robust two-stage methodology has

    been introduced in order to both perform classificationof executable files and detect the files with the highestprobability of being false negatives (malware that arelabeled as possibly clean files). It has been shown thatthe methodology is not only accurate but is also reducingby several orders of magnitude the computational time.This makes the proposed methodology a valid candidateas a pre-processing tool to provide inputs to forensicexperts in order to detect malwares that have not yetbeen detected by the AV engines used in VirusTotal.Furthermore, this methodology can also be applied toother labeling. Also, new and dierent dynamic features

    will be investigated and combined with static features(code signatures, packer information. . . ) extracted fromthe samples before sandbox execution. This will be thenatural continuation of the presented work.

    AcknowledgmentsThe authors of this paper would like to acknowledge F-

    Secure Corporation for providing the data and softwarerequired to perform this research. Special thanks go toPekka Orponen (Head of the ICS Department, AaltoUniversity), Alexey Kirichenko (Research CollaborationManager F-Secure) and Daavid Hentunen (ResearcherF-Secure) for their valuable support and many usefulcomments. This work was supported by TEKES as partof the Future Internet Programme of TIVIT. Part of thework of Amaury Lendasse and Alexander Ilin is fundedby the Adaptive Informatics Research Centre, Centre ofExcellence of the Finnish Academy.

    References

    [1] Y. Liu, L. Zhang, J. Liang, S. Qu, and Z. Ni, Detectingtrojan horses based on system behavior using machinelearning method, in Machine Learning and Cybernetics(ICMLC), 2010 International Conference on, vol. 2, July2010, pp. 855 860.

    [2] I. Firdausi, C. Lim, A. Erwin, and A. Nugroho, Anal-ysis of machine learning techniques used in behavior-based malware detection, in Advances in Computing,Control and Telecommunication Technologies (ACT),2010 Second International Conference on, December2010, pp. 201 203.

    [3] E. Menahem, A. Shabtai, L. Rokach, and Y. Elovici,Improving malware detection by applying multi-inducer ensemble, Computational Statistics & DataAnalysis, vol. 53, no. 4, pp. 1483 1494, 2009.

    [4] L. Sun, S. Versteeg, S. Bozta, and T. Yann, Pat-tern recognition techniques for the classification of mal-ware packers, in Information Security and Privacy, ser.Lecture Notes in Computer Science, R. Steinfeld andP. Hawkes, Eds. Springer Berlin / Heidelberg, 2010,vol. 6168, pp. 370390.

    [5] J. Kinable and O. Kostakis, Malware classificationbased on call graph clustering, Journal in ComputerVirology, pp. 113, 2011.

    [6] A. Srivastava and J. Gin, Automatic discovery ofparasitic malware, in Recent Advances in IntrusionDetection (RAID10), ser. Lecture Notes in ComputerScience, S. Jha, R. Sommer, and C. Kreibich, Eds.Springer Berlin / Heidelberg, 2010, vol. 6307, pp. 97117.

    [7] C. Willems, T.Holz, and F. Freiling, Toward automateddynamic malware analysis using cwsandbox, IEEE Se-curity and Privacy, vol. 5, pp. 3239, March 2007.

  • [8] K. Yoshioka, Y. Hosobuchi, T. Orii, and T. Matsumoto,Vulnerability in public malware sandbox analysis sys-tems, in Proceedings of the 2010 10th IEEE/IPSJ In-ternational Symposium on Applications and the Internet,ser. SAINT 10. Washington, DC, USA: IEEE Com-puter Society, 2010, pp. 265268.

    [9] M.Bailey, J. Andersen, Z. Morleymao, and F. Jaha-nian, Automated classification and analysis of internetmalware, in Recent Advances in Intrusion Detection(RAID07), 2007.

    [10] P. Jaccard, Etude comparative de la distribution floraledans une portion des alpes et des jura, Bulletin de laSociete Vaudoise des Sciences Naturelles, vol. 37, pp.547579, 1901.

    [11] P. Tan, M. Steinbach, and V. Kumar, Introduction toData Mining. Addison Wesley, 2005.

    [12] Hispasec Systemas, Virus total analysis tool, 2011,http://www.virustotal.com.

    [13] Norman ASA, Norman launches sandbox sdk, April2009, http://www.norman.com/about_norman/press_center/news_archive/2009/67431/en.

    [14] F-Secure Corporation, F-secure deepguard a proac-tive response to the evolving threat scenario, Novem-ber 2006, http://www.rp-net.com/online/filelink/340/20061106%20F-secure_deepguard_whitepaper.pdf.

    [15] , F-secure deepguard 2.0 - white paper,September 2008, http://www.f-secure.com/system/fsgalleries/white-papers/f-secure_deepguard_2.0_whitepaper.pdf.

    [16] , Information about system control and deep-guard, January 2011, http://www.f-secure.com/kb/2034.

    [17] D. Aha and D. Kibler, Instance-based learning algo-rithms, in Machine Learning, 1991, pp. 3766.

    [18] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft,When is "nearest neighbor" meaningful? in Int. Conf.on Database Theory, 1999, pp. 217235.

    [19] C. Bishop, Neural Networks for Pattern Recognition,1st ed. Oxford University Press, USA, Jan. 1996.

    [20] P. Devijver and J. Kittler, Pattern recognition: A statis-tical approach. Prentice Hall, 1982.

    [21] B. Efron and R. Tibshirani, An Introduction to theBootstrap. New York: Chapman & Hall, 1993.

    [22] , Improvemenets on cross-validation: The .632+bootstrap method, Journal of the American StatisticalAssociation, vol. 92, no. 438, pp. 548560, 1997.

    [23] A. Lendasse, V. Wertz, and M. Verleysen, Model selec-tion with cross-validations and bootstraps - applicationto time series prediction with rbfn models, LectureNotes in Computer Science (including subseries LectureNotes in Artificial Intelligence and Lecture Notes inBioinformatics), vol. 2714, pp. 573580, 2003, cited By(since 1996) 8.

    [24] Q. Yu, Y. Miche, A. Sorjamaa, A. Guilln, A. Lendasse,and E. Sverin, OP-KNN: Method and applications,Advances in Artificial Neural Systems, vol. 2010, no.597373, February 2010, 6 pages.

    [25] W. B. Johnson and J. Lindenstrauss, Extensions oflipschitz mappings into a hilbert space, in Conferencein Modern Analysis and Probability, New Haven, USA,1982, pp. 189206.

    [26] D. Achlioptas, Database-friendly random projections:Johnson-lindenstrauss with binary coins, J. Comput.Syst. Sci., vol. 66, no. 4, pp. 671687, 2003.

    [27] S. Dasgupta, Experiments with random projection, inProceedings of the 16th Conference on Uncertainty inArtificial Intelligence, ser. UAI 00. San Francisco, CA,USA: Morgan Kaufmann Publishers Inc., 2000, pp. 143151.

    [28] X. Fern and C. Brodley, Random projection for highdimensional data clustering: A cluster ensemble ap-proach, in International Conference on Machine Learn-ing (ICML03), 2003, pp. 186193.

    [29] D. Fradkin and D. Madigan, Experiments with randomprojections for machine learning, in KDD 03: Proceed-ings of the ninth ACM SIGKDD international conferenceon Knowledge discovery and data mining. New York,NY, USA: ACM, 2003, pp. 517522.

    [30] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten,and A. Lendasse, OP-ELM: Optimally-pruned extremelearning machine, IEEE Transactions on Neural Net-works, vol. 21, no. 1, pp. 158162, January 2010.

    [31] S. Vempala, The Random Projection Method, ser. DI-MACS Series in Discrete Mathematics and TheoreticalComputer Science. American Mathematical Society,2005, vol. 65.

  • A Two-Stage Methodology using K-NN and FalsePositive Minimizing ELM for Nominal Data

    Classification

    Yoan Miche1, Anton Akusok1, Jozsef Hegedus1, Rui Nian4and Amaury Lendasse1,2,3

    1Department of Information and Computer Science, Aalto University,FI-00076 Aalto, Finland

    2IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Spain3Computational Intelligence Group, Computer Science Faculty,

    University Of The Basque Country,Paseo Manuel Lardizabal 1, Donostia/San Sebastin, Spain

    4College of Information and Engineering,Ocean University of China,Qingdao, 266003 China

    Abstract

    In this paper, a methodology for performing binary classification on nominaldata under specific constraints is proposed. The goal is to classify as manysamples as possible while avoiding False Positives at all costs, all within thesmallest possible computational time. Under such constraints, a fast wayof calculating pairwise distances between the nominal data available for allsamples, is proposed. A two-stage decision methodology using two types ofclassifiers then provides a fast means of obtaining a classification decision ona sample, keeping False Positives as low as possible while classifying as manysamples as possible (high coverage). The methodology only has two param-eters, which respectively set the precision of the distance approximation andthe final tradeo between False Positive rate and coverage. Experimentalresults using a specific data set provided by F-Secure Corporation show that

    Email address:{yoan.miche,anton.akusok,jozsef.hegedus,amaury.lendasse}@aalto.fi,[email protected] (Yoan Miche1, Anton Akusok1, Jozsef Hegedus1, Rui Nian4and Amaury Lendasse1,2,3)

    Preprint submitted to Elsevier July 4, 2012

  • this methodology provides a rapid decision on new samples, with a directcontrol over the False Positives.

    1. Introduction

    Classification problems relying solely on the distances between the dier-ent samples are common in genetics [1], or syntactic and document resem-blance problems [2, 3]. The reason for the direct use of the distance matrix inthese setups is that the original data does not lie in a Euclidean space, but isusually nominal data, i.e. without any sense of ordering between two dierentvalues. As such, distance matrices need to be calculated using non-Euclideanmetrics, usually.

    The interest in this paper is about the problem of binary classification forsuch nominal data problems, under certain specific constraints: zero FalsePositives, high coverage and small computational time.

    While the high coverage contraint is rather typical (achieving the high-est True Positive and True Negative rates possible), the zero False Positiveconstraint is not. In addition, the False Negatives are not regarded as veryimportant in this problem setup: even if lowering False Negatives means in-creasing the coverage, the most highly regarded requirement is on the FalsePositives.

    As mentioned, the fact that the data is nominal makes it mandatory touse methods which directly deal with the distance matrix. A means of com-puting this distance matrix is first described, by the use of an approximationtechnique based on Min-Wise independent hash function families.

    The following Section 2 describes a very specific application of this pro-posed methodology to Malware detection for computer security. This ap-plication is exactly framed by the previously mentioned constraints. In ad-dition, this application provides experimental data on which the proposedmethodology is tested in Section 4.

    Section 3 describes first the matter of calculating distances between sam-ples and then how the use of the Jaccard distance remains possible withthe low-computational time imperative, by estimating it using Locality Sen-sitive Hashing. A 1-Nearest Neighbor classifier is then proposed as a firststep and its shortcomings listed, while Section 4 details the complete two-step methodology which addresses these issues along with the experimentalresults.

    2

  • 2. A Specific Application

    The goal of Anomaly Detection in the context of computer intrusion de-tection [4] is to identify abnormal behavior defined as deviating from whatis considered normal behavior and signal the anomaly in order to take ap-propriate measures: identification of the anomaly source, shutdown/closingof sensitive information or software. . .

    Most current anomaly detection systems rely on sets of heuristics or rulesto identify this abnormality. Such rules and heuristics enable some flexibilityon the detection of new anomalies, but still require action from the expert totune the rules according to the new situation and the potential new anoma-lies identified. One ideal goal is then to have a global system capable oflearning what constitutes normal and abnormal behavior and therefore beable to identify reliably new anomalies [5, 6]. In such a context, the onlyhuman interaction required is the monitoring of the system, to ensure thatthe learning phase happened properly.

    A small part of the whole anomaly detection problem is studied in thispaper, in the form of a binary classification problem for malware and cleansamples. While the output of this problem is quite typical, the input is not.In order to compare files together and compute a similarity between them, aset of features is needed. F-Secure Corporation devised such a set of features[7], based partly on sandbox execution (virtual environment for a sampleexecution [8, 9]). This sandbox is capable of providing a wide variety ofbehavioral information (events), which as a whole can be divided into twomain categories: hardware-specific or OS-specific. The hardware-specific in-formation is related to the low-level, mostly CPU-specific, events occurringduring execution of the application being analyzed in the virtual environ-ment (up to the CPU instruction flow tracing). The other category mostlyrelates to the events caused by interaction of the application with the vir-tual OS (the sandbox). This category includes information such as GeneralThread/Process events (e.g. Start/Stop/Switching), API call events, specificevents like Structure Exception Handling, system module loading etc. Be-sides, the sandbox can provide (upon user request) some other informationabout application execution, like reaching pre-set breakpoints, detecting un-typical behavioral patterns, which are not typical for traditional well-writtenbenign applications (e.g., so-called anti-emulation and anti-debugging tricks),etc.

    The sandbox features used in the following research are thus the dynamic

    3

  • Static Features

    Feature set

    SampleDynamicFeatures

    Sandbox

    Figure 1: Feature extraction from a file (sample): The sandbox runs the sample in avirtual environment and extracts dynamic (run-time specific) information; meanwhile aset of static features are extracted and both sets are combined in the whole feature set.

    component of the collected features. Dynamic features in this context refer tothose gathered from the Sandbox while an inspected application was executedin it. Some examples of those are what API calls were called and withwhat parameters, various types of memory and code fingerprints. Staticfeatures refer to some of the features gathered from the executable binaryitself without actually executing it. Some examples of those are what packerit was compressed with and various code and data fingerprints. There are15 features from the static domain and as many from the dynamic domain,containing up to tens of thousands of values each. Each of these featurescan be present or absent for one sample (e.g. if the sample studied does notperform some classical operations in the sandbox, some features do not getactivated). As such, the input data obtained per sample usually consists oftens of thousands of values for each feature number. The feature values arerepresented by CRC64 hashes.

    One of the major challenges is related to this data size: Each samplehaving some tens of thousands (on average) of feature-value pairs (at most30 features per sample, with thousands of values per feature for one sam-ple), sample to sample comparisons are non-trivial computationally speak-ing. Also, due to the nature of the data, measuring similarities between filesrequires specific metrics that can be applied to nominal data (i.e. with nosense of order between values, as opposed to ordinal data). Indeed, since theactual feature values are encoded as hashes (and represent function stringsand series of arguments, parameters. . . ), classical measures used in Euclideanspaces do not apply. The Jaccard similarity enables such comparisons and is

    4

  • ActualMalware Clean

    Prediction Malware True Positive (TP) False Positive (FP)Clean False Negative (FN) True Negative (TN)

    Table 1: Confusion Matrix for this binary classification problem.

    detailed in Section 3, with the computational challenges it poses.In addition to this specificity of the data, the requirements on the perfor-

    mance of the classifier are particular as well. As a security company, F-SecureCorporation needs to have very low false positives on any anomaly detectionsystem deployed: If a clean file is labeled as a malware (i.e. is a false positive),it is likely that several clients will see this same error deployed on their ma-chines as well. This single mistake will potentially hinder seriously the workon all the aected machines, making the clients unhappy about the productand thus deactivating it or switching to a concurrent one. Therefore, whiletypical binary classification problems addressed by machine learning focuson optimizing the accuracy, one of the goals of the methodology presentedin this paper is to lower the false positives to achieve 0. To clarify notations,Table 1 summarizes the confusion matrix used in this paper.

    Some additional practical constraint also makes this problem particular.Since the goal is the identification and classification of new malware samples,there is an imperative on the time it takes to have a decision per sample:The fastest an answer is provided, the quicker will be the deployment of theinformation concerning a new sample, possibly preventing infection at manyother sites. As such, computational times need to be reduced as much aspossible.

    3. Problem Description

    This section first describes the problem in terms of the nature of thedata at hand, and a way to calculate distances between files, using this verydata. The matter of the computational requirements for such calculations areaddressed by an approximation based on Min-Wise independent families ofhash functions. The parameters of this approximation are then determinedand its eects investigated.

    5

  • 3.1. Data Specifics and Distance Calculation3.1.1. Data Specifics

    Distances in a traditional Euclidean sense are usually calculated for pointswhich coordinates locate them in the space. Having a data set consisting ofmultiple hashes with dierent hashes representing incomparable propertiesor attributes, makes that data eectively categorical, and does not allow tocalculate distances in a classical manner. The specifics and origin of the dataset used in this paper are confidential as the data is provided by F-SecureCorporation. Original values present in the data have been hashed using theCRC64 hash function, so as to obfuscate the original details.

    The data set is composed of a large amount of files (samples), each havingthe following structure:

    30 possible feature numbers (each representing a dierent class of in-formation recorded about the sample)

    For each of these feature numbers, a variable amount of hashes (from0 to tens of thousands).

    The reason for this structure is that some feature numbers are standing fora wide range of possible informations: if one such feature number stands forthe names of all the functions called in this sample, e.g., the number ofvalues associated to it is bound to be large for some samples. It is importantto note that the number of feature values per feature number can be verydierent from file to file.

    With this data structure, it is impossible to use traditional MachineLearning techniques, as most of them rely on the data points position inthe sample space (usually expected to be Euclidean). In this paper, dis-tances between samples are calculated by using the Jaccard index [10, 11],as presented in the next subsection.

    3.1.2. Distance Calculation for Nominal DataOne of the most classical similarity statistics for nominal data is the

    Jaccard index [10]. It enables the computation of the similarity betweentwo sets of nominal attributes as the ratio between the cardinalities of theirintersection and of their union. Denoting A and B two sets of nominalattributes, the Jaccard index is defined as

    JpA,Bq |AXB||AYB| . (1)

    6

  • This index intuitively gives a good sense of overlap (similarity) betweenthe two sets; the more common attributes (hashes in this case) they have,the more statical and dynamical properties the corresponding files eachassociated with one set share, thus the higher the chance that they areof the same class. In addition, considering the Jaccard distance JpA,Bq 1 JpA,Bq yields an actual metric, making which enables to use MachineLearning techniques directly.

    In the case of this paper, the files not only have one set of attributes,but multiple, identified by their feature number. As such let us redefineA tAiuiPA, where Ai is the set of hashes associated to feature numberi, and A is the set of all feature numbers available for file A. Therefore,the Jaccard index needs to take into account all such feature numbers. Astraightforward modification of the Jaccard index for this case is to define itas

    JpA,Bq 1|C|iPC

    |Ai XBi||Ai| ` |Bi| |Ai XBi| (2)

    where Ai and Bi are the sets of feature values for feature number i forfile A and B respectively, and C AB, with A (resp. Bq the set of thefeature numbers for file A (resp. B).

    This way, only feature numbers present in both files are accounted for. Inaddition, expressing the index like this enables to avoid computing the cardi-nality of the union, which saves some computational time, as the cardinalityof the sets Ai and Bi are known.

    The computational time required for the multiple calculations of the Jac-card distance remains a problem, due to the intersection cardinality calcula-tion. This problem is addressed in the following subsection by approximatingthe Jaccard distance.

    3.2. Speeding up the distance calculationsThe main drawback of the original Jaccard distance lies in the compu-

    tational time required for its calculation. While the intersection of two sets(the upper part of the fraction in Eq. 2) is relatively fast for exam-ple, the Python language implementation of it has an average complexity ofOpmin t|Ai| , |Bi|uq and a worst case of O p|Ai| |Bi|q [12], the intersectionof such large sets repeated multiple times makes the total computational timeintractable. As mentioned before, the sets Ai for one single feature numberi can total some tens of thousands of elements.

    7

  • As such, the direct Jaccard distance calculations using Eq. 2 cannot beused. The specific requirement for this problem of near real-time compu-tations raises the need for an fast approximation of the Jaccard distance.

    3.2.1. Resemblance as an alternative to Jaccard indexConsider a file named A, and denote by |A| the number of hashes in this

    file (to avoid heavy notations, it is considered that only one feature numberis present in the files; the following extends directly to the practical case ofmultiple feature numbers per file). Let us define by SpA, lq the set of allcontiguous subsequences of length l of hashes of A. Using these notations,one can define [3] the resemblance rlpA,Bq of two files A and B based ontheir hashes as

    rlpA,Bq |SpA, lq X SpB, lq||SpA, lq Y SpB, lq| , (3)which is similar to the original definition of the Jaccard index. Defining theresemblance distance as

    dlpA,Bq 1 rlpA,Bq (4)yields an actual metric [3, 2].

    Let us fix the size of the contiguous subsequences of hashes l and denoteby l the set of all such subsequences of length l. Let us assume that lis totally ordered and set a number of elements n. For any subset !l ldenote by MINn p!lq the set of the smallest n elements (using the order onl) of !l defined as

    MINn p!lq #the set of the smallest n elements from !l, if |!l| n!l, otherwise.

    (5)

    From [3], the following theorem gives an unbiased estimate of the resem-blance rlpA,Bq.Theorem 1. Let : l l a permutation on l chosen uniformly atrandom and let MpAq MINn p pS pA, lqqq. Defining MpBq similarly, thefollowing is an unbiased estimate of rlpA,Bq:

    rlpA,Bq |MINn pMpAq YMpBqq XMpAq XMpBq||MINn pMpAq YMpBqq| .The proof can be found in [3].As such, once a random permutation is chosen, it is possible to only use

    the set MpAq (instead of the whole of A) for resemblance-based calculations.

    8

  • 3.2.2. Weak Universal Hashing and Min-Wise Independent FamiliesNote that while CRC64 cannot be considered as a random hash function,

    the notion of weak universality for a family of hash functions proposed in[13] makes it possible to further extend the former approximation to familiesof hash functions satisfying

    Pr ph ps1q h ps2qq 1M

    , (6)

    with h a hash function chosen uniformly at random from the family H offunctions U M, s1 and s2 elements from the origin space U of the hashfunction in H and M |M|. More precisely, in [14], the definition of min-wise independent family of functions is proposed in the spirit of the weakuniversality concept, and the authors show that for such families of functions,the resemblance can be computed directly.

    Define as min-wise independent a family H of functions such that for anyset X v1, Nw and any x P X, when the function h is chosen at random inH, we have

    Pr pmin thpXqu hpxqq 1|X| . (7)That is, all elements of the set X must have the same probability to

    become the minimum element of the image of X under the function h. As-suming such a min-wise independent family H, then

    Pr pmin thpSpA, lqqu min thpSpB, lqquq rlpA,Bq, (8)for files A and B and a function h chosen uniformly at random from H; it istherefore possible to compute the resemblance rlpA,Bq of files A and B bycomputing the cardinality of the intersection

    tmin ph1 pS pA, lqqq , . . . ,min phk pS pA, lqqqu tmin ph1 pS pB, lqqq , . . . ,min phk pS pB, lqqqu , (9)

    where h1, . . . , hk are a set of k independent random functions from H. Thisway of calculating the resemblance of two files is sometimes called min-hash,and this name is used in the rest of this paper to denote this approach.

    For computational and practical reasons, in this paper only one hashfunction is used (CRC64) and the cardinality of the intersection of equation9 is approximated as the cardinality of

    tmink ph pS pA, lqqqu tmink ph pS pB, lqqqu , (10)9

  • where the notation minkpXq denotes the set of the k smallest elements in X(assuming X is fully ordered). While this is a crude approximation, exper-iments show that the convergence with respect to k towards the true valueof the resemblance is assured, as shown in the following subsection.

    3.2.3. Influence of the number of hashes on the proposed min-hash approxi-mation

    Figure 2 illustrates experimentally the validity of the proposed approxi-mation of the Jaccard distance by the min-hash based resemblance. Theseplots use a small subset of 3000 samples from the whole dataset, used onlyfor this purpose of validating the amount of hashes k required for a properapproximation.

    As can be seen, with low amounts of hashes, such as k 10 or 100(subfigures (a) and (b)), quantization eects appear on the estimation of theresemblance, and the estimation errors are large. These quantization prob-lems are especially important in regard to the method using these distances K-Nearest Neighbors , as presented in the next section: Since distancesare so much quantized, samples being at dierent distances appear to be atthe same, and can thus be taken as nearest neighbors wrongly.

    The quantization eects are lessened when k reaches the hundreds ofhashes, as in subfigure (c), while the errors on the estimation remain large.k 2000 hashes reduces such errors to only the largest distances, which areof less importance for the following methodology. While k 10000 hashesreduces these errors further (and even more so for larger values of k), themain reason for using the min-hash approximation described is to reducedrastically the computational time.

    Figure 3 is a plot of the average time required per sample for the de-termination of the distances to the whole reference set, with respect to thenumber of hashes k used for the min-hash. Thanks to the use of the ApacheCassandra backend (with three nodes) for these calculations1, the computa-tional time only grows linearly with the number of hashes (and also linearlywith the number of samples in the reference set, although this is not depictedhere). Unfortunately, large values of k do not decrease the computationaltime suciently for the practical application of this methodology. Therefore,

    1Details of the implementation are not given in this paper, but can be found fromthe publications and deliverables of the Finnish ICT SHOK Programme Future Internet:http://www.futureinternet.fi

    10

  • (a) k 10 hashes (b) k 100 hashes

    (c) k 500 hashes (d) k 1000 hashes

    (e) k 2000 hashes (f) k 10000 hashes

    Figure 2: Influence of the number of hashes k over the min-hash approximation of theresemblance r. The exact Jaccard distance is calculated using the whole amount of theavailable hashes for each sample.

    11

  • 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

    50

    100

    150

    200

    250

    300

    Number of hashes used (k)

    Aver

    age

    Tim

    e pe

    r Sam

    ple (s

    econ

    ds)

    Figure 3: Average time per sample (over 3000 samples) versus the number k of hashesused for the min-hash approximation.

    in the following, k 2000 hashes is used for the min-hash approximationof the Jaccard distance, as a good compromise between computational timeand approximation error.

    4. Methodology using two-stage classifiers

    This section details the use of a two-stage decision strategy so as to avoidFalse Positives while retaining high coverage. The first stage decision uses a1-NN, which still yields too high False Positive rates; this rate is lowered byusing an optimized Extreme Learning Machine model, specialized either forFalse Positives or False Negatives minimization.

    4.1. First Stage Decision using 1-NN4.1.1. Using K-NN with min-hash Distances

    The K-Nearest Neighbor [15] method for classification is one of the mostnatural to use in this setup, since it relies directly and only on distances. Asmentioned in the previous subsection, for this classifier to perform well, itrequires the proper identification of the real nearest neighbors: the approx-imation made using the min-hash cannot be too crude.

    Using k 2000 hashes, a reference set is devised by F-Secure Corpo-ration which contains samples that are considered to be representative ofmost current malware and clean samples. This set contains about 10000 sam-ples (for each of which the k 2000 minimum hashes have been extracted

    12

  • Sandbox Data

    Malware

    Clean

    UnknownNearest Neighbors

    with Jaccard Distance

    ELMFP

    ELM FN

    Clean

    Malware

    Unknown

    } }

    First StageDecision

    Second StageDecision

    Figure 4: 1-NN-ELM: Two stage methodology using first a 1-NN and then specializedELM models to lower false positives and false negatives. The first stage uses only theclass information C1NN of the nearest neighbor, while the second stage uses additionalneighbors information: the distance d1NN to the nearest neighbor, the distance dNN tothe nearest neighbor of the opposite class and the rank RNN (i.e. which neighbor is it)of this opposite class neighbor.

    per feature number), balanced equally between clean and malware samples.The determination of this reference set is especially important as it shouldnot contain samples for which there are some uncertainties about the class:Only samples with the highest probability of being either malware or cleanare present in the reference set.

    Once this reference set is fixed, samples can be compared against it usingthe min-hash based distances and a K-NN classifier.

    Determining K for this problem is done using a validation set for whichthe certainty of the class of each sample is very high as well. The validationset contains 3000 samples, checked against the reference set of 10000 samples.Figure 5 depicts the classification accuracy (average of True Positive and TrueNegative rates) versus the value of K used for the K-NN. Surprisingly, thedecision based on the very first nearest neighbor is always the best in termsof classification accuracy. Therefore, in the following methodology presentedin Section 4, a 1-NN is used as the first step classifier.

    4.1.2. 1-NN is not sucientAs mentioned earlier, one of the main imperatives in this paper is to

    achieve 0 False Positives (in absolute numbers). As Table 2 depicts, by usinga test set (totally separate from the validation sets used above) composed of28510 samples for which the class is known with the highest confidence, with

    13

  • 3 5 7 9 110.91

    0.915

    0.92

    0.925

    0.93

    0.935

    0.94

    0.945

    0.95Cl

    assif

    icatio

    n Ac

    cura

    cy

    Number of Nearest Neighbors used (K)13 15 17

    Figure 5: K 1 is the best for this specific data regarding classification accuracy.Actual

    Malware Clean

    Prediction Malware 18160 183Clean 277 9890

    Table 2: Confusion Matrix for the sole 1-NN on the test set. If only the first stage of themethodology is used, results are unacceptable in terms of False Positive rates.

    the 1-NN approach still yields large amounts of False Positives. Note thatthis test set is unbalanced, although not significantly.

    The results of the 1-NN are not satisfactory regarding the constraint onthe False Positives. An obvious way of addressing directly the amount ofFalse Positives is to set a maximum threshold on the distance to the firstnearest neighbor: Above this threshold, the sample is deemed too far fromits nearest neighbor, and no decision is taken.

    While this strategy would eectively reduce the number of False Positives,it lowers significantly the number of True Positives as well, i.e. the coverage.For this reason, and to keep a high coverage, the following methodology usinga second stage classifier as the ELM, is proposed.

    As can be seen from Figure 3, the computational time required to calcu-late the distance from a test sample to the whole 10000 reference set samplesis about 35 seconds on average, using k 2000 hashes. This is still accept-able, from the practical point of view, but adding a second stage classifier

    14

  • ?(a) Case 1

    ?

    (b) Case 2

    Figure 6: Illustration of dierent situations with identical 1-NN: in (a) the density of ref-erence samples of the same class around the test sample gives the decision high confidence;in (b) while the 1-NN is of the same class as for (a), the confidence should be very dierenton the decision.

    has the obvious drawback of increasing this time.In order to make this increase the smallest possible, an Extreme Learning

    Machine model specialized for False Positives (and another for False Nega-tives) is used. Figure 4 illustrates the global idea of this two-stage method-ology.

    The motivation for an additional classifier comes from the fact that thesingle information from the 1-NN is not sucient: the distance to that firstneighbor is important as well, and so is the distance and the rank of thenearest neighbor of the opposite class. Figure 6 attempts to illustrate twodierent situations for which a test sample has its first nearest neighbor inthe same class note that the position of the samples has no meaning here,due to the nominal nature of the data; the distances are the interesting fact.In the first case (a), the confidence on the decision must be high, as many ofthe neighbors of the test sample are near and of the same class. The case (b)is very dierent and needs to have a much lower confidence on the decisiontaken, if any.

    A means of describing such situations is to account for:

    1. The distance to the nearest neighbor d1NN: If the nearest neighbor isfar, it is likely that the test sample is in a part of the original spacewhere the reference samples density is insucient;

    2. The distance to the nearest neighbor of the opposite class dNN: Ifd1NN is very similar to dNN, the test sample lies in a part of thespace where reference samples of both classes are present and at similar

    15

  • distances;3. The rank of this neighbor of opposite class RNN (is it the 3rd or

    100th neighbor?): This information gives a rough sense of the densityof the reference samples of the same class as that of the nearest neighboraround the test sample.

    The combination of these additional three pieces of information describesroughly the situation in which the test sample lies. This is the informationfed to the second stage classifier for the final decision.

    4.2. Second Stage Decision using modified ELM4.2.1. Original ELM

    The Extreme Learning Machine (ELM) algorithm was originally proposedby Guang-Bin Huang et al. in [16, 17, 18, 19] and it uses the Single LayerFeedforward Neural Network (SLFN) structure. The main concept behindthe ELM lies in the random initialization of the SLFN weights and biases.Then, under certain conditions, the synaptic input weights and biases do notneed to be adjusted (classically through an iterative updates such as back-propagation) and it is possible to calculate implicitly the hidden layer out-put matrix and hence the output weights. The complete network structure(weights and biases) is thus obtained with very few steps and very low com-putational cost (compared to iterative methods for determining the weights,e.g.).

    Consider a set of M distinct samples pxi,yiq with xi P Rd1 and yi P Rd2 ;then, a SLFN with N hidden neurons is modeled as the following sum

    Ni1

    ipwixj ` biq, j P J1,MK, (11)with being the activation function, wi the input weights, bi the biases andi the output weights.

    In the case where the SLFN would perfectly approximate the data, theerrors between the estimated outputs yi and the actual outputs yi are zeroand the relation between inputs, weights and outputs is then

    Ni1

    ipwixj ` biq yj, j P J1,MK, (12)which writes compactly as H Y, with

    16

  • H pw1x1 ` b1q pwNx1 ` bNq... . . . ...

    pw1xM ` b1q pwNxM ` bNq, (13)

    and pT1 . . . TNqT and Y pyT1 . . .yTMqT .Solving the output weights from the hidden layer output matrix H

    and target values is achieved through the use of a Moore-Penrose generalizedinverse of the matrix H, denoted as H: [20].

    Theoretical proofs and a more thorough presentation of the ELM algo-rithm are detailed in the original paper [16]. In Huang et al.s later work ithas been proved that the ELM is able to perform universal function approx-imation [19].

    4.2.2. False Positive/Negative Optimized ELMAs depicted on Figure 6 and mentioned above, the single information of

    the class of the nearest neighbor is not sucient to obtain 0 False Positives.The proposed second stage classifier uses modified ELM models for lower-ing the amounts of False Positives one of the two modified ELM modelsreduces False Negatives as well; only the False Positive minimizing one ismentioned in the following.

    The modified ELM model used in the second stage of the methodologyis specially optimized so as to minimize the False Positives (a similar modelto minimize the False Negatives is used as well, in the same fashion). It usesadditional information gathered while searching for the nearest neighbor (sono additional computational time is required to obtain the training data): thedistance to the nearest neighbor d1NN, the distance to the nearest neighborof the opposite class dNN, and the rank of this neighbor of opposite classRNN. With this input data, the False Positive Optimized ELM is trainedusing a weighted classification accuracy criterion.

    While for binary classification problems, the classification rate Acc definedas the average of the True Positive Rate TPR and True Negative Rate TNR,

    Acc TNR` TPR2

    , (14)

    is typically used as a performance measure, the proposed modified ELM usesthe following weighted accuracy Accpq

    17

  • Accpq TNR` TPR1` . (15)

    By changing the weight, it becomes possible to give precedence to the TrueNegative Rate and thus to avoid false positives. The output of the proposedFalse Positive Optimized ELM is calculated using Leave-One-Out (LOO)PRESS (PREdiction Sum of Squares) statistics which provides a direct andexact formula for the calculation of the LOO error "PRESS for linear models.See [21] and [22] for details of this formula and its implementations:

    "PRESS yi hii1 hiPhTi , (16)

    where P is defined as P pHTHq1, H is the hidden layer output matrix ofthe ELM and i are the output weights of the ELM.

    In order to obtain a parsimonious model in the shortest possible time,the proposed modified ELM uses the idea of the TROP-ELM [23] and OP-ELM [24, 25, 26, 27, 28] to prune out neurons from an initially large ELMmodel [29]. In addition, for computational time considerations, the maxi-mum number M of selected neurons desired for the final model is taken asa parameter. Overall, the False Positive Optimized ELM used in this paperfollows the steps of Algorithm 1.

    Algorithm 1 False Positive Optimized ELM.Given a training set pxi, yiq,xi P R3, yi P t1, 1u, an activation function : R R, a large number of hidden nodes N and the maximum numberM N of neurons to retain for the final model:- Randomly assign input weights wi and biases bi, i P J1, NK;- Calculate the hidden layer output matrix H as in Equation 13;for i 1 to M do- Perform Forward Selection of the i best neurons (among N) usingPRESS LOO output with Accpq criterion, and ELM determination ofthe output weights i;

    end for- Retain the best combination out of the M dierent selections as the finalmodel structure.

    The selection of the optimal is done experimentally, following the twoconstraints of 0 False Positives and highest possible coverage (i.e. as many

    18

  • 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    FP rate

    TP ra

    te

    Figure 7: ROC curve (True Positive Rate versus False Positive Rate) for varying valuesof .

    True Positives as possible). Figure 7 is the Receiver Operating Characteristiccurve for various values of , plotted for a balanced 3000 samples validationset. As can be seen, the requirement on absolutely 0 False Positives hasa strong influence on the coverage (represented by the True Positives ratehere). If one allows as low as 0.06% False Positives, the coverage reaches92% already.

    Figure 8 depicts the plot of the False Positive rate against the value.This plot is using the same validation data as Figure 7. The value of forwhich the 0 False Positives requirement is met while keeping highest possiblecoverage is 30, form Figure 8.4.3. Final Results on Test Data

    With the parameters of the two-stage methodology determined as above,i.e.:

    k 2000 hashes used for the min-hash approximation of the Jaccarddistance;

    K 1 for the K-NN first stage classifier; =30 for the False Positive Optimized ELM second stage classifier,

    19

  • 0 5 10 15 20 25 300

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    _

    False

    Pos

    itive

    Rate

    Figure 8: Evolution of the False Positive Rate as a function of the weight. The firstattained 0 False Positive Rate is for 30.

    the presented methodology is applied to a test set of 28510 samples spanningfrom early 2008 until late 2011. The reference set of 10000 samples mentionedbefore is within the same time frame and balanced between malware andclean so as to reflect the real proportions, i.e. that of the samples receivedby F-Secure Corporation. The proportions are roughly 2{3 malwareand 1{3 clean.

    Tables 3 give the previous results of the sole 1-NN to be compared againstthe ones of the 1-NN and False Positive Optimized ELM methodology.

    It can be seen that the False Positive rate achieved in test is in line withthe results from the Leave-One-Out in (a).

    The results depicted in Table 3 (c) use not only a False Positive Opti-mized ELM but also a False Negative Optimized ELM, to reduce the FalseNegatives, as mentioned on Figure 4. The improvements in the reduction ofthe False Positives and the coverage achieved are satisfying for this test set.

    A value of 2 False Positives in this test set is probably acceptable inpractice. If the strict goal of 0 False Positives in test is to be enforced,then one possibility is to increase the parameter to a higher value, moreconservative. This has the eect of lowering further the coverage, though.

    Note on hardware and computational time considerations. While the detailsof the implementation are not mentioned in this paper, the proposed method-

    20

  • ActualMalware Clean

    PredictionMalware 1930 1Clean 1 908

    Unknown 2473 1623

    (a) Confusion Matrix for the two-stage classifier methodology on thetraining data (Leave-One-Out results).

    ActualMalware Clean

    Prediction Malware 18160 183Clean 277 9890

    (b) Confusion Matrix for the sole 1-NN on the test set.Actual

    Malware Clean

    PredictionMalware 8393 2Clean 7 4115

    Unknown 10037 5956

    (c) Confusion Matrix for the two-stage classifier methodology on the testset.

    Table 3: Confusion matrices for (a) the training data (Leave-One-Out results) when train-ing the False Positive/Negative Optimized ELMs; on the whole test set, (b) using onlythe 1-NN approach and (c) using the proposed 1-NN and ELM two-stage methodology.The reduction in coverage from the second stage ELM is noticeable, as False Positives andNegatives are decreased significantly.

    21

  • ology uses a set of three computers, each equipped with 8GB of RAM, andIntel Core2 Quad CPUs. Apache Cassandra is the distributed databaseframework used for performing ecient min-hash computations in batches,and a memory-held queueing system (based on memcached) is holding jobsfor execution against Cassandra database. All additional computations areperformed using Python code on one of the three computers mentioned.

    With this setup, as seen on Figure 3, the average per sample evaluationtime i.e. calculating pairwise distances to the 10000 reference samples andfinding the closest elements is about 35 seconds. The choice of Cassandraas a database backend is meant so that the computational time grows onlylinearly if the precision of the min-hash or the number of reference samplesis increased linearly: growing the number of reference samples linearly orthe number k of hashes used for the min-hash approximation only requires alinear growth in the number of Cassandra nodes for the computational timeto remain identical.

    5. Conclusions

    This paper proposes a practical case oriented methodology for a binaryclassification problem in the domain of Anomaly Detection. The practicalproblem at hand lies in the classification of files (samples) as either malwareor clean, based on specific sets of nominal attributes, thus requiring purelydistance-based Machine Learning techniques. The practical requirements forthis binary classification problem are somewhat unusual, as no False Positivescan be tolerated, while as many files as possible should be classified in theminimum computational time. The False Negatives are not as important inthis context.

    In order to perform file to file comparisons, a distance measure knownas the Jaccard distance is adapted to this problem setup, and a fast ap-proximation of it, the Min-Hash approximation, is proposed. The Min-Hashapproach enables to obtain an estimation of the Jaccard distance using a re-stricted amount of the whole sets of attributes of each file, thus lowering thecomputational time significantly. This approximation is shown to convergeexperimentally to the true Jaccard distance, given enough hashes.

    A two-stage decision process using two dierent types classifiers enablesto provide a fast decision while keeping the False Positive rate low: A 1-NNmodel using the estimated Jaccard distance provides an initial decision onthe test sample at hand. Following in the second stage is a False Positive

    22

  • Optimized ELM a False Negative Optimized ELM is used as well to reduceFalse Negatives , which enables to reduce drastically the False positives,from 183 to 2 in test, at the cost of a lower coverage. Another advantage ofthe ELM-based second classifier is its very low computational time, allowingto have this second-stage decision for almost no additional time.

    Overall, the methodology proves to be ecient for this specific problemand has the advantage of having only two parameters that require tuning:the number of hashes used for the Min-Hash approximation the more used,the close is the approximation to the real Jaccard distance value , and thecoecient weighting the False Positives in the modified ELM criterion thevalue of this coecient controls the tradeo between False Positive rate andcoverage directly.

    The parameters devised experimentally for the specific reference set en-able to reach only 2 False Positives in test, with a coverage on the malwarefiles of 44%. This methodology is currently being tested at F-Secure Corpo-ration on dierent data sets (reference and test) for further validation.

    References

    [1] S. Lele, J. T. Richtsmeier, Euclidean distance matrix analysis: acoordinate-free approach for comparing biological shapes using land-mark data., American journal of physical anthropology 86 (3) (1991)415427.

    [2] A. Z. Broder, S. C. Glassman, M. S. Manasse, G. Zweig, Syntacticclustering of the Web, Computer Networks and ISDN Systems 29 (8-13)(1997) 11571166.

    [3] A. Z. Broder, On the resemblance and Containment of Documents, in:Compression and Complexity of SEQUENCES 1997, IEEE Comput.Soc, 1997, pp. 2129.

    [4] Y. Robiah, S. S. Rahayu, M. M. Zaki, S. Shahrin, M. A. Faizal, R. Marl-iza, A New Generic Taxonomy on Hybrid Malware Detection Technique,arXiv.org cs.CR.

    [5] A. Srivastava, J. Gin, Automatic Discovery of Parasitic Malware, in:S. Jha, R. Sommer, C. Kreibich (Eds.), Recent Advances in IntrusionDetection (RAID10), Springer Berlin / Heidelberg, 2010, pp. 97117.

    23

  • [6] M. Bailey, J. Andersen, Z. Morleymao, F. Jahanian, Automated classi-fication and analysis of internet malware, in: Recent Advances in Intru-sion Detection (RAID07), 2007.

    [7] F-Secure Corporation, F-Secure DeepGuard A proactive response tothe evolving threat scenario (Nov. 2006).

    [8] C. Willems, T. Holz, F. Freiling, Toward Automated Dynamic MalwareAnalysis Using CWSandbox, IEEE Security and Privacy 5 (2007) 3239.

    [9] K. Yoshioka, Y. Hosobuchi, T. Orii, T. Matsumoto, Vulnerability inPublic Malware Sandbox Analysis Systems, in: Proceedings of the 201010th IEEE/IPSJ International Symposium on Applications and the In-ternet, IEEE Computer Society, Washington, DC, USA, 2010, pp. 265268.

    [10] P. Jaccard, tude comparative de la distribution florale dans une por-tion des alpes et du jura, Bulletin de la Socit Vaudoise des SciencesNaturelles 37 (1901) 547579.

    [11] P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, 1stEdition, Addison Wesley, 2005.

    [12] Python, Python algorithms complexity, http://wiki.python.org/moin/TimeComplexity#set (December 2010).URL http://wiki.python.org/moin/TimeComplexity#set

    [13] J. L. Carter, M. N. Wegman, Universal Classes of Hash Functions, Jour-nal of Computer and System Sciences 18 (2) (1979) 143154.

    [14] A. Z. Broder, M. Charikar, A. M. Frieze, M. Mitzenmacher, Min-wiseIndependent Permutations, Journal of Computer and System Sciences60 (1998) 327336.

    [15] T. M. Cover, P. E. Hart, Nearest neighbor pattern classification, IEEETransactions on Information Theory 13 (1) (1967) 2127.

    [16] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme Learning Machine: The-ory and Applications, Neurocomputing 70 (2006) 489501.

    24

  • [17] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme Learning Machinefor Regression and Multiclass Classification, IEEE Transactions on Sys-tems, Man, and Cybernetics, Part B: Cybernetics 42 (2) (2012) 513529.

    [18] G.-B. Huang, Q.-Y. Zhu, K. Z. Mao, C.-K. Siew, P. Saratchandran,N. Sundararajan, Can threshold networks be trained directly?, IEEETransactions on Circuits and Systems II: Express Briefs 53 (3) (2006)187191.

    [19] G.-B. Huang, L. Chen, C.-K. Siew, Universal approximation using in-cremental constructive feedforward networks with random hidden nodes,IEEE Transactions on Neural Networks 17 (4) (2006) 879892.

    [20] C. R. Rao, S. K. Mitra, Generalized Inverse of Matrices and Its Appli-cations, John Wiley & Sons Inc, 1971.

    [21] R. Myers, Classical and Modern Regression with Applications, 2nd edi-tion, Duxbury, Pacific Grove, CA, USA, 1990.

    [22] G. Bontempi, M. Birattari, H. Bersini, Recursive lazy learning for mod-eling and control, in: European Conference on Machine Learning, 1998,pp. 292303.

    [23] Y. Miche, M. van Heeswijk, P. Bas, O. Simula, A. Lendasse, TROP-ELM: a double-regularized ELM using LARS and tikhonov regular-ization, Neurocomputing 74 (16) (2011) 24132421. doi:10.1016/j.neucom.2010.12.042.

    [24] E. Group, The op-elm toolbox, available online at http://www.cis.hut.fi/projects/eiml/research/downloads/op-elm-toolbox(2009).

    [25] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, A. Lendasse, OP-ELM: Optimally-pruned extreme learning machine, IEEE Transactionson Neural Networks 21 (1) (2010) 158162. doi:10.1109/{TNN}.2009.2036259.

    [26] Y. Miche, P. Bas, C. Jutten, O. Simula, A. Lendasse, A methodology forbuilding regression models using extreme learning machine: OP-ELM,

    25

  • in: M. Verleysen (Ed.), ESANN 2008, European Symposium on Artifi-cia