ieee transactions on industrial informatics,...

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2822680, IEEETransactions on Industrial Informatics

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, MANUSCRIPT 1

Detection of Malicious Code Variants Based onDeep Learning

Zhihua Cui, Fei Xue, Xingjuan Cai, Yang Cao, Gai-ge Wang, Member, IEEE,and Jinjun Chen, Senior Member, IEEE

Abstract—With the development of the Internet, maliciouscode attacks have increased exponentially, with malicious codevariants ranking as a key threat to Internet security. The ability todetect variants of malicious code is critical for protection againstsecurity breaches, data theft, and other dangers. Current methodsfor recognizing malicious code have demonstrated poor detectionaccuracy and low detection speeds. This paper proposed a novelmethod that used deep learning to improve the detection ofmalware variants. In prior research, deep learning demonstratedexcellent performance in image recognition. To implement ourproposed detection method, we converted the malicious code intograyscale images. Then the images were identified and classifiedusing a convolutional neural network (CNN) that could extractthe features of the malware images automatically. In addition,we utilized a bat algorithm to address the data imbalance amongdifferent malware families. To test our approach, we conducteda series of experiments on malware image data from VisionResearch Lab. The experimental results demonstrated that ourmodel achieved good accuracy and speed as compared with othermalware detection models.

Index Terms—Malware variants, Grayscale image, Deep learn-ing, Convolution neural network, Bat algorithm

I. INTRODUCTION

W ith the rapid development of information technology,the exponential growth of malicious code has become

one of the main threats to Internet security. A recent reportfrom Symantec showed that 401 million malicious codes werefound in 2016, including 357 million new malicious codevariants [22]. The appearance of malware in mobile devices

Manuscript received September 29, 2017; revised January 9, 2017; acceptedMarch 28, 2018. Date of publication xxx, 2018; date of current versionxxx, 2018. This work is supported in part by the National Natural ScienceFoundation of China under Nos. 61003053 and 61403271, and is alsosupported by Natural Science Foundation of Shanxi Province under GrantNo. 201601D011045. Paper no. TII-18-xxx. (Corresponding Author: XingjuanCai.)

Z. Cui, X. Cai are with the Complex System and Computational IntelligenceLaboratory, TaiYuan University of Science and Technology, Taiyuan, Shanxi,China, 030024. (E-mail:[email protected]; [email protected])

F. Xue is with the School of Information, Beijing Wuzi University, Beijing,China, 101149. (E-mail: [email protected])

Y. Cao is with the Faculty of Information Technology, Beijing Universityof Technology, Beijing, China, 100124. (E-mail: [email protected])

G.-G. Wang is with the Department of Computer Science and Technol-ogy, Ocean University of China, Qingdao, China, 266100. (e-mail: [email protected])

J. Chen is with the Complex System and Computational Intelligence Labo-ratory, Taiyuan University of Science and Technology, Taiyuan, Shanxi, China,030024; and also with Swinburne Data Science Research Institute, SwinburneUniversity of Technology, Australia. (e-mail: [email protected])

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.xxx

File Sample Set

Windows API CallsOpcode N-grams

StringsControl Flow Graphs

Feature Selection

Classification/Clustering Model

Feature Extraction

Information GainTF-IDF

Document frequencyFP-growth

Decision TreeNative BayesSVM、KNN

File Set

M B

Fig. 1. Malicious code detection based on data mining

and the internet of things (IoT) also grew rapidly. To date,68 new malicious code families and more than 10 thousandmalicious codes have been reported. This growth has poseda challenge for malicious code detection in cloud computing[14], [15].

As a key part of security protection, uncovering maliciouscode variants is particularly challenging. Malware detectionmethods consist primarily of two types of approaches: staticdetection and dynamic detection. Static detection works bydisassembling the malware code and analyzing its executionlogic. Dynamic detection analyzes the behavior of maliciouscode by executing the code in a safe virtual environment orsandbox.

Both static and dynamic detection are feature-based detec-tion methods. First, the textual or behavioral features of themalicious code are extracted, and then the malicious code isdetected or classified by analyzing these extracted features. Inrecent years, several scholars have used data mining methodsto analyze the features of malicious code [28]. This approachhas become the mainstay of malware detection because itis highly efficient and has a low rate of false positivescompared with traditional heuristic-based detection methods.Fig. 1 illustrates the process of detecting malicious code usingdata mining.

Unfortunately, methods based on feature analysis are oftendisrupted. The effectiveness of static feature analysis can behampered by the obfuscation techniques that transform themalware binary into a self-compressed or uniquely structuredbinary. Dynamic feature analysis is often challenged by manykinds of countermeasures developed to produce unreliable




results. Furthermore, some types of malicious code may beignored by dynamic analysis because the execution environ-ment does not comply with the rules.

Recently, rather than focusing on non-visible features formalware classification, Nataraj et al. [19] proposed malwarevisualization, a new approach based on image processingtechniques. This work transformed the structure of packedbinary samples into two-dimensional grayscale images. Thenthe image features were used for classification. Moreover, twodifferent feature design strategies are studied comparativelyfor malware analysis [20]. The results illustrated that thebinary texture-based strategy provided performance roughlyequivalent to the dynamic analysis techniques, but in less time.

Challenges. Malware detection methods rely mainly onanalysis of the features of malicious codes (e.g., static features,dynamic features). More powerful detection methods based onvarious machine learning techniques also use these features touncover malicious codes or their variations. However, theseapproaches become less effective when detecting maliciouscode variants or unknown malware. The malware visualizationmethod can handle code obfuscation problems, but it suffersfrom the high time cost needed for complex image texturefeature extraction (e.g., GIST, GLCM). Moreover, these fea-ture extraction methods also demonstrate low-efficiency whenexposed to large datasets. The challenge for building malwaredetection models is to find a means for extracting featureseffectively and automatically.

Moreover, the data imbalance problem imposes anotherchallenge. Of the large quantity of malware generated eachyear, a substantial portion includes variants that belong toexisting malicious code families or groups. Usually, the num-ber of malicious code variations differs greatly among variouscode families. The challenge is to build a universal detectionmodel that can deal with the huge volume of variations, sothat it can work well across malware families.

Contributions. To address the above challenges, this paperoffered the following contributions:

• We introduced a technique for converting a malware bi-nary to an image, thereby transforming malware detectioninto an image classification problem.

• We proposed a novel method for detecting malwarevariants based on a convolutional neural network (CNN).

• To resolve the data imbalance problem among differentmalware families, we designed an effective data equilib-rium approach based on the bat algorithm.

• Extensive experimental results demonstrated that our pro-posed method was an effective and efficient approach formalware detection.

The remainder of this paper is structured as follows. SectionII reviews related work, then Section III details our approachto malicious code detection based on a convolutional neuralnetwork. Section IV presents our data equalization methodbased on a bat algorithm and data augmentation. Experimentalevaluations of the proposed methods are given in Section V.Finally, Section VI concludes this paper.

II. RELATED WORK

In this section, we present the related research regardingmalware, including malware detection based on feature analy-sis, malicious code visualization, image processing techniquesfor malware detection, and malware detection based on deeplearning.

A. Malware Detection based on Feature Analysis

As described previously in this paper, there are two maincategories of feature analysis techniques for malware detec-tion: static analysis and dynamic analysis. With respect tostatic analysis, several methods have been put forward byanalyzing the codes. For example, Isohara et al. [12] devel-oped a detection system using kernel behavior analysis thatperformed well detecting the malicious behaviors of unknownapplications. However, the static method is easily deceived byobfuscation techniques [18].

To solve this problem, Christodorescu et al. [4] proposed anovel malware-detection algorithm that used trace semanticsto characterize the behavior of malware. This method provedeffective in combating the obfuscation of instructions (e.g., re-arrangement of instructions, insertion of garbage code, registerredistribution). However, the approach was limited to featureextraction and analysis at the instruction level. Moreover, thepattern-matching was complex.

Dynamic analysis monitors and analyzes the runtime char-acteristics of applications (apps) based on assessment of be-haviors, such as accessing private data and using restricted APIcalls. Given this information, a behavior model is establishedto detect malicious code. Such techniques achieved improveddetection performance, but they were still challenged by thevariety of countermeasures developed to generate unreliableresults [20]. In addition, dynamic analysis is time-consumingbecause of the large amount of computational overhead, lead-ing to low efficiency when exposed to a large dataset.

B. Malicious Code Visualization

Currently, many tools can visualize and manipulate binarydata, e.g., common text editors and binary editors. Several s-tudies have proposed utilization of malware visualization [24].Yoo et al. [29] employed self-organizing maps to visualizecomputer viruses. A similar but richer approach was proposedby Trinius et al. [23] , who used two visualization techniques—tree maps and thread graphs—to detect and classify malware.Rather than a single detection result, Goodall et al. [10]aggregated the results of different malware analysis tools into avisual environment, providing an increase in the vulnerabilityof detection coverage of single malware tools.

The above studies focused mainly on the visualization ofmalware behavior, but the software source code may implymore meaningful patterns. As previously mentioned, Natarajet al. [19] presented a new visualization approach for malwaredetection based on binary texture analysis. First, they convert-ed the malware executable file into grayscale images. Thenthey identified malware according to the texture features ofthese images. Compared with the dynamic analysis method,




their approach produced equivalent results [20]. In similarwork, Han et al. [11] transformed malware binary informationinto color image matrices, and classified malware families byusing an image processing method.

C. Image Processing Techniques for Malware Detection

Once the malware has been visualized as grayscale images,malware detection can be converted into an image recognitionproblem. In [19], Nataraj et al. used a GIST algorithm toextract the features of malware images. However, the GISTalgorithm was time-consuming. Recently, more powerful im-age processing techniques have been proposed.

Daniel et al. [8] developed a bio-inspired parallel implemen-tation for finding the representative geometrical objects of thehomology groups in a binary 2D image. For image fusion,Miao et al. [17] proposed an image fusion algorithm basedon shearlet and genetic algorithm. In their approach, a geneticalgorithm is employed to optimize the weighted factors in thefusion rule. Experimental results demonstrated their methodcould acquire better fusion quality than other methods.

These traditional approaches, however, were challengedby the high time cost required for complex image texturefeature extraction. To address this challenge, we employeddeep learning to identify and classify images efficiently. In thenext subsection, we present our research on malware detectionbased on deep learning.

D. Malware Detection based on Deep Learning

Deep learning [21] is an area of machine learning researchthat has emerged in recent years from the work on artificialneural networks. Neural networks can approximate complexfunctions by learning the deep nonlinear network structure tosolve complex problems. Deep learning, which is more power-ful than back propagation (BP), uses a deep neural network tosimulate the human brain’s learning processes. Deep learninghas the ability to learn the essential characteristics of data setsfrom a sample set. As a powerful tool of artificial intelligence,deep learning has been applied widely in many fields, such asrecognition of handwritten numerals, speech recognition, andimage recognition.

Because of it powerful ability to learn features, manyscholars have applied deep learning to malware detection.Using deep learning techniques, Yuan et al. [30] designed andimplemented an online malware detection prototype system,named Droid-Sec. Their model achieved high accuracy bylearning the features extracted from both static analysis anddynamic analysis of Android apps. David et al. [7] presenteda similar but more compelling method that did not need thetype of malware behavior. Their work was based on a deepbelief network (DBN) for automatic malware signature gener-ation and classification. Compared with conventional signaturemethods for malware detection, their approach demonstratedincreased accuracy for detecting new malware variants.

Unfortunately, these methods remained based on the anal-ysis of features extracted by static analysis and dynamicanalysis. Therefore, to a greater or lesser extent, they continuedto be subject to the limitations of feature extraction. To address

011010011101101...Malware

Binary bit string

Binary to 8 bit vector,vector to gray scale image

CNN

Malware image

Agent.FYI family

Swizzor.gen!I family

Dontovo.A family

Fig. 2. Overview of the proposed method

this problem, we employed a CNN network to learn themalware image features and classify them automatically.

III. MALWARE DETECTION BASED ON A CONVOLUTIONALNEURAL NETWORK

This section presents our improved malicious code variantdetection method based on a CNN, which included: (1) themalicious code mapping as the grayscale image; and (2) theCNN design for grayscale image detection. Fig. 2 presentsan overview of these two processes. First, the binary filesof malicious code are transformed into the grayscale images.Next, the convolution neural network is employed to identifyand classify the images. According to the results of imageclassification, we realized the automatic recognition and clas-sification of malicious software.

A. Binary Malware to Gray Image

In general, there are several ways to transform binarycode into images. In this paper, we used the visualiza-tion of executable malware binary files [19]. A malwarebinary bit string can be split into a number of substringsthat are 8 bits in length. Each of these substrings canbe seen as a pixel, because the 8 bits can be interpret-ed as an unsigned integer in the range 0 to 255. Forexample, if a bit string is 0110000010101100, the pro-cess is 0110000010101100→01100000,10101100→96,172.An eight-bit binary number B=(b7, b6, b5, b4, b3, b2, b1, b0) canbe converted into a decimal number I as follows:

I =b0 ∗ 20 + b1 ∗ 21 + b2 ∗ 22 + b3 ∗ 23

+ b4 ∗ 24 + b5 ∗ 25 + b6 ∗ 26 + b7 ∗ 27(1)

After binary conversion, the binary malware bit stringhas been converted into a 1-D vector of decimal numbers.According to a specified width, this one-dimensional arraycan be treated as a 2-D matrix of a certain width. Finally,the malicious code matrix is interpreted as a grayscale image.For simplicity, the width of the image is fixed, and the heightof the image varies depending on the size of the file. TableI, taken from [19], presents some recommended image widthsfor various file sizes, based on empirical observations.




TABLE IIMAGE WIDTH FOR VARIOUS FILE SIZES

File size Image width File size Image width

<10 kB 32 100 kB∼200 kB 38410 kB∼30 kB 64 200 kB∼500 kB 51230 kB∼60 kB 128 500 kB∼1000 kB 768

60 kB∼100 kB 256 >1000 kB 1024

(a) Agent.FYI family (b) Dontovo.A family

(c) Swizzor.gen!E family (d) Swizzor.gen!I family

Fig. 3. Illustration of malware images

Fig. 3 shows examples of malware images from differentfamilies [19]. As we can see, the images of the same malwarefamily are visually similar, and they differ noticeably from im-ages belonging to another family. For example, Fig. 3(a) showsfour variants of the malware Agent, called the Agent.FYIfamily. The size of each member is different, but they stillhave similarities because new malware is often created fromold software. In addition, when families are similar, imagescan display their differences clearly. For example, comparedwith the Swizzor.gen!I family in Fig. 3(c), the Swizzor.gen!Efamily in Fig. 3(d) has black stripes. Inspired by the visualsimilarity of malware images, we can classify and detectmalicious software by using image recognition methods. In ourcurrent work, we used a convolutional neural network (CNN)to recognize malware images.

B. Malware Image Classification based on ConvolutionalNeural Network

As a hot topic in speech analysis and image recognition,convolutional neural networks have developed quickly. Theirlocal perception and weight sharing network structure reducethe complexity of network models and the number of weights.A CNN has even further advantages when the input is multidi-mensional. For our work, the image can be considered as theinput of the network. Compared with traditional recognition al-gorithms, this method is not hampered by complicated feature

extraction and data reconstruction. Convolutional networksare a multi-layer perceptron designed for identifying two-dimensional shapes, and they are robust with respect to imagedeformation (e.g., translation, scale, rotation).

In this paper, we developed a CNN to classify malware.The structure of the CNN for grayscale image recognitionconsists of several components, as shown in Fig. 4. First is theinput layer, which brings the training images into the neuralnetwork. Next are the convolution and sub-sampling layers.The former layer can enhance signal characteristics and reducenoise. The latter can reduce the amount of data processingwhile retaining useful information. Then there are several fullyconnected layers that convert a two-dimensional feature into aone-dimensional feature that conforms to the classifier criteria.Finally, the classifier identifies and sorts the malware imagesinto different families according to their characteristics. Thedetailed iterative formula of each layer can be given as follows:

(1) Convolution layersThe convolution layer can reduce the number of image

parameters while preserving the main features, termed invari-ance, including translation invariance, rotation invariance, andscale invariance. This process can avoid overfitting effectively,and improve the generalization ability of the model. The inputis several maps, and the output is the maps after dimensionreduction. Each map is a combination of convolution valuesof input maps that belong to the upper layer [1], and can begiven by the following equation:

xlj = f(∑

i∈Mj

xl−1j ∗ klij + blj) (2)

where Mj is the collection of input maps, klij is the convo-lution kernel used for the connection between the ith inputfeature map and the jth output feature map, blj is the biascorresponding to jth feature map, and f is the activationfunction. The sensitivity is:

δlj = δl+1j W l+1

j ◦ f′(ul) = βl+1

j up(δl+1j ) ◦ f

′(ul) (3)

where l+1 layer is the sampling layer, the weight W representsa convolution kernel, and its value βl+1

j ◦ up(.) is an up-sampling operation.

The partial derivative of the error cost function with respectto bias b and convolution kernel k can be given as follows:

∂E

∂bj=

∑u,v

(δlj)u,v (4)

∂E

∂klij=

∑u,v

(δlj)u,v(pl−1i )u,v (5)

where (pl−1i )u,v is the patch for each convolution of xl−1

i andklij , (u, v) is the center of the patch, and its position value canbe obtained by convolution of the patch of the (u, v) positionin the input feature map and the convolution kernel klij .

(2) Sub-sampling layersThe sub-sampling layer is also called the pooling layer.

Generally, its convolution kernels are the maximum or meanof patch (maximum pooling, average pooling), and are notmodified by backward propagation. This layer can weakenthe influence of image deformation (e.g., translation, scale,




Input layer

convolution layer sub-sampling layer convolution layer sub-sampling layer fully connected layer…

Output layer

…

Fig. 4. Illustration of malware classification based on a CNN

rotation). It reduces the dimension of the feature map, im-proves the model’s accuracy, and avoids overfitting. In CNN,for each output of the sampling layer, the feature map is givenas follows:

xlj = f(down(xl−1j ) + blj) (6)

where down(.) represents a sub-sampling function, and b isbias. The sensitivity is calculated as shown:

δlj = δl+1j W l+1

j ◦ f′(ul) (7)

The partial derivative of the error cost function with respectto bias b can be given as follows:

∂E

∂bj=

∑u,v

(δlj)u,v (8)

So far, we can get the weight updating formula of the CNN,and use it to classify malware images. However, in the CNN,a fixed size image is needed for a structured network becausethat the connection matrix W of the full connection layeris a fixed size after training. For example, if the input andoutput sizes are 30 and 20 neurons, respectively, from theconvolution layer to the full connection layer, then the size ofthe weight matrix must be (30,20). Unfortunately, the imagesize of malware is not fixed, but varies with the size of thesoftware. Therefore, we cannot input these malware imagesdirectly into a CNN.

To solve this problem, in this paper, we reshaped themalware image to a fixed size square image (e.g., 48*48,96*96). In this way, the malware images were normalizedand could be entered directly into the CNN network forclassification. The advantage of normalization is that it reducesthe dimensionality of the image effectively, and is moreconducive to the training of the model. In the dimensionalityreduction process, inevitably some characteristic informationis lost.

For most images in our malware dataset, texture featurescan be preserved effectively whether there is an enlargementor reduction. An example can be found in the variants ofDontovo.A family shown in Fig. 5(a). The original imagesize was 64*257. After compression (24*24 and 48*48) orexpansion (96*96 and 192*192), the texture feature remainedsharp (the upper part is black, and the bottom is gray).However, for those big images with small textures feature, thereshape operation may cause the loss of critical information.For example, the textures feature (black dots at the bottom)

(a) Dontovo.A family

(b) Swizzor.gen!E family

24*24

48*48

96*96

192*192

24*24

48*48

96*96

192*192

Fig. 5. Reshape the malware image to a fixed size square image

of a variant from the Swizzor.gen!E family in Fig. 5(b) wasdamaged when it was compressed too much (from 512*718to 24*24 or 48*48).

IV. MALWARE IMAGE DATA EQUILIBRIUM

High-quality data is the key to machine learning and deeplearning. Rough sample data will affect the quality of themodel. In contrast, a model trained by high-quality data ismore robust (avoids overfitting), and can speed up modeltraining.

Malicious software usually consists of several families, andthe number of samples of each family varies widely. Thisvariation in data will lead to low accuracy and poor robustnessof the training model. To address this problem, we employeda data augmentation technique to improve data quality. Weproposed a data equalization method based on an intelligentoptimization algorithm (i.e., dynamic resampling based ona bat algorithm) to resolve the problem of imbalanced databetween different families.

A. Image Data Augmentation Technology

In deep learning, to avoid overfitting, we usually need toenter sufficient data to train the model. If the data sampleis relatively small, we can use data augmentation to increase




Fig. 6. Images generated by data augmentation for Swizzor.gen!E family

the sample, thereby restraining the influence of imbalanceddata. The appropriate data augmentation method can avoidoverfitting problems effectively, and improve the robustnessof the model.

Generally, transformation of the original image data (chang-ing the location of image pixels and ensuring that the featuresremain unchanged) is used to generate new data. There aremany kinds of data augmentation techniques for images, forexample, rotation/reflection, flip, zoom, shift, scale,contrast, noise, and color transformation.

Let A = [a1, a2, ..., a8] be a collection of these techniques.Mi=am, ..., an is the sequence of operations, and i is thelength of M . M2 = a1, a2 represents data augmentation byusing rotation transformation and flip transformation.

Typically, each augmentation technique has a weight λi,so the sequence of operations of weighted data augmentationtechniques can be given by

Mi = λmam, ..., λnan (9)

where λm is the weight of the augmentation technology, andam ⊆ A. Usually it is necessary to perform multiple dataaugmentations. For a grayscale image matrix m, Mk(m) rep-resents k augmentations. Fig. 6 shows some images generatedby data augmentation for the malware Swizzor.gen!E family.Clearly, most of the newly generated images retained theoriginal texture features. However, a few images lost some in-formation through inappropriate transition (i.e., overrotation).The images viewed in the lower-left corner lack a texturefeature (black dots at the bottom) of the Swizzor.gen!E family.

B. Data Equalization based on a Bat Algorithm

Imbalanced data is a common problem in machine learning,and it can affect the accuracy of the model in classificationproblems. In this paper, the malware image data was very un-even, as shown in Fig. 7. The maximum ratio was about 36:1.For example, the Allaple.A family had 2,949 images, whilethe Skintrim.N family had 80. Such differences have a verybig impact on a CNN’s performance in image classification.The classifying model trained according to the data of the twofamilies may achieve 97% accuracy when all the samples inthe Skintrim.N family are not identified. However, it is not areasonable classifier because it can only identify some data.

Algorithm 1: DRBA(dataset) operationInput: dataset: training set for CNN; bats: bat

population;Output: trainset: training set for each epoch;

1 (p, v)← getState(bats);2 while t < maxIterations do3 for bati ⊂ bats do4 (pt+1

i , vt+1i )← updateState(pti, vti);

5 templeset← reSample(dataset, pt+1i );

6 accuracyt+1i ← trainModel(templeset);

7 end8 bats← localSearch(bats);9 p∗ ← getBest(bats);

10 t = t+ 1;11 end12 trainset← reSample(dataset);13 Return(trainset);

In classification, resampling is a simple method used tohandle an unbalanced training set. The resampling methodconverts the imbalanced data set into a balanced dataset byprocessing the training set. It consists of two implementations:oversampling and undersampling. Oversampling is used tomake multiple copies of subsets. Undersampling is used toremove some samples from the sample set (i.e., to select onlysome samples).

For instance, take the Allaple.A family (2,949 images) andthe Skintrim.N family (80 images). If the training set for theformer consists of 1,000 samples, we can choose 500 of them.For the latter, because of the insufficient number of data, wemust repeat sampling until the training sample reaches 500.Because the training sample is less than the total sample, thetrained classification model for the Skintrim.N family willsuffer from overfitting. Therefore, the proper proportion ofsamples is important for the training set. Unfortunately, itis difficult to find an optimal proportion due to the complexweight combinations with dozens of families.

A swarm intelligent algorithm is an effective solution forcomplex optimization problems. Many researchers have madeseveral in-depth studies about swarm intelligence and its appli-cations [9], [16], [26]. Wang et al. [25] used the evolutionarymulti-objective optimization (EMO) algorithms to deal withthe floorplan problem in cyber-physical social system. Cui etal. [6] applied a modified cuckoo search algorithm to improvethe performance of DV-Hop.

To address the problem of weight combination, we proposeda dynamic resampling method based on the bat algorithm(DRBA). The bat algorithm is a novel swarm intelligence algo-rithm [27], inspired by the echolocation behavior of microbats.Because of its potential parallelism and robustness, severalresearchers applied the bat algorithm to the optimization prob-lems [2], [3], [5]. We employed it to optimize the samplingweights of multiple malware families. In the model trainingprocess, to balance samples properly, each class is resampledaccording to the weight for each epoch. Suppose the numberof malicious code families is N , the resampling weights are




0

500

1000

1500

2000

2500

3000N

umbe

r of

mal

war

e

Adialer

.C

Agent

.FYI

Allaple

.A

Allaple

.L

Aluero

n.ge

n!J

Autor

un.K

C2LOP.g

en!g

C2LOP.P

Dialpla

tform

.B

Donto

vo.A

Faker

ean

Insta

ntac

cess

Lolyd

a.AA1

Lolyd

a.AA2

Lolyd

a.AA3

Lolyd

a.AT

Male

x.gen

!J

Obfus

cato

r.AD

Rbot!g

en

Skintri

m.N

Swizzor

.gen

!E

Swizzor

.gen

!I

VB.AT

Wint

rim.B

X

Yuner

.A

Fig. 7. Number of malware images in different families

denoted by wi, i ⊂ N . For this optimization, the position of abat individual is a combination of sampling weights that canbe given by the following equation:

position = w1, ..., wn (10)

The accuracy of the training model is an objective function.The aim of the optimization is to find the optimal positionof the bat when the accuracy reaches the set threshold. Thedetailed resampling process is shown in Algorithm 1. Thefunction updateState (line 4) is used to update the velocityand position of bats according to the rules in [2]. The functionlocalSearch is the local searching of the bat population. Theaccuracy is the fitness, which can be obtained by the functionsreSample and trainModel (lines 5,6). As the names suggest,the former is a resampling function, and the latter is a functionthat trains the model according to resampling data.

V. EXPERIMENTAL EVALUATION

In this section, we present our evaluation of the effectivenessof the proposed method. Our experiments were based ona malware dataset from the Vision Research Lab [19]. Weemployed the neural network framework Caffe [13] to createand train a CNN. We ran our experiments on a PC with an IntelCore i5-4590 CPU (3.3GHz), Nvidia GeForce GTX 750TiGPU (2GB), and 8GB RAM.

A. Dataset and Experimental Setting

The dataset consisted of 9,342 grayscale images of 25malware families [19]. The number of samples in each familywas different. As shown in Fig. 7, there was a great deal ofimbalance in the number of raw data samples. Therefore, itwas necessary to use a data balancing method to balance thenumber of training samples.

First, we used data augmentation techniques to improve thequality of small data samples. The expansion of small datasamples can solve problems (e.g., overfitting) caused by theunbalanced training data, effectively. The specific settings forimage augmentation are shown in Table II.

In Table II, rotation range represents the rotationrange, width shift is the range of horizontal translation,

TABLE IISETTINGS OF GRAYSCALE MALWARE IMAGE AUGMENTATION

Methods Settings Methods Settings

rotation range 0.1 shear range 0.1width shift 0.1 zoom range 0.1height shift 0.1 horizontal flip truerescale 1/255 fill mode nearest

height shift is the range of vertical translation, rescale isthe ratio of image magnification or reduction, shear rangeis the range of projection transformation in the horizontal orvertical direction, and zoom range is the ratio of randomlyzooming image. In addition, we used the nearest method tofill the image when flipping or expansion.

For our model, we designed different architectures of theCNN for malware images with different sizes. For the 24*24inputs, our model had seven layers, including five hiddenlayers. The detailed structure was as follows: C1:8*20*20,S2:8*10*10, C3:16*8*8, S4:16*4*4, and C5:80*1*1. Eachtype of convolution kernel corresponded to a type of featuremap. The maps per layer referred to the number of featuremaps per layer. For the input of other sizes, as the sizeincreased, the number of layers increased (e.g., double sizemeant adding two layers). Furthermore, we used the imresizefunction with interpolation method bicubic (default) in MAT-LAB R2017a to resize/normalize the images.

The initial learning rate was set to 0.01, and the decaypolicy was inv. The maximum number of iterations was10000. Finally, the training was carried out using the GPU.Moreover, in the image data balancing method DRBA, forBA, the population size was 20, other parameters were thesame with [2].

For evaluation metrics, we used the accuracy, precisionand recall. These evaluation metrics are adopted frequently inthe research community to provide comprehensive assessmentsof imbalanced learning problems. These metrics are defined asfollows:

Precision =TP

TP + FP(11)

Recall =TP

TP + FN(12)

Accuracy =TP + TN

TP + TN + FP + FN(13)

where True Positive (TP) and False Positive (FP) are thenumber of file samples correctly and wrongly classified asmalicious, respectively. Similarly, True Negative (TN) andFalse Negative (FN) are the number of file samples correctlyand wrongly classified as benign, respectively.

B. Experimental Results

To validate the effectiveness and efficiency of the proposedapproach, we designed experiments to (1) verify the efficiencyof different data equalization methods, (2) ascertain the effectsof different malware image sizes, and (3) compare the results




1 2 3 4 5 6 7 8 9 10Epoch

0.5

0.6

0.7

0.8

0.9

1A

ccur

acy(

%)

NothingIDADRBAIDA+DRBA

(a) Accuracy

1 2 3 4 5 6 7 8 9 10Epoch

00.10.20.30.40.50.60.70.80.9

1

Loss

(%)


(b) Loss

Fig. 8. Performance of different classification models on the training set

of our proposed approach with the outcomes of other methodsfor detecting malicious code.

1) Efficiency of malware image data equalization: To ver-ify the validity of our proposed data equalization method,including the image data augmentation technique (IDA) andDRBA, we compared the results when no equalization methodwas used (Nothing) with models using IDA, DRBA, andIDA+DRBA. The performance of the different classificationmodels on the training set is shown in Fig. 8. As seen fromFig. 8(a), all methods achieved a good result with respect toaccuracy, even if the data set was not processed. The Nothingmethod may have demonstrated overfitting. Fig. 8(b) showsthe Loss curve, for which the results of the four methodswere similar.

Fig. 9 shows the performance of the different classificationmodels on the validation set. As we can see from Fig. 9(a), theNothing method reflected the classification of all the data withthe basic CNN model. Because of the imbalance of the sam-ples, it was easy to cause over-fitting, so the Nothing methodappeared to show a high level of accuracy. The accuracy ofthe DRBA method was the best. The accuracy rate increasedcontinuously with the increase of the epoch. The accuracyof the IDA method was not as good as its performance onthe training set, perhaps because some features of the imagewere lost in the image transformation process. There weresome differences between training data and test data. As seenfrom Fig. 9(b), the loss curve of DRBA and IDA+DRBA wasdecreasing, while the curve of Nothing was oscillating withina certain range, which meant it was overfitting.

1 2 3 4 5 6 7 8 9 10Epoch

0.5

0.6

0.7

0.8

0.9

1

Acc

urac

y(%

)


(a) Accuracy

1 2 3 4 5 6 7 8 9 10Epoch

00.10.20.30.40.50.60.70.80.9

1

Loss

(%)


(b) Loss

Fig. 9. Performance of different classification models on the validation set

2) Model performance with different malware image size:The CNN framework required all input images to have a fixedsize. Therefore, we reshaped the malware image to a fixedsize square image. The smaller the transformed image, themore obvious the texture features were destroyed. In contrast,the larger images had good fidelity, and the texture featureswere clear. We conducted experiments setting the malwareimage at 24*24, 48*48, 96*96 and 192*192, respectively. Theconfusion matrix is shown in Fig. 10.

From the results we can see that our model performed betteras the image became larger. However, a bigger image requiresmore training time. We noted that the performances at 96*96and 192*192 were similar, as shown in Fig. 10(c). Since theclassification model would take more time to train using thelarger image, the 96*96 image was a good choice for trainingon this malware dataset.

3) Comparison of our method with other malicious codedetection methods: To validate the effectiveness and efficiencyof our proposed approach, we compared our model with fourother malware detection models that used common imagefeature extraction methods. Each of these methods extractedfeatures of the malware images, and then applied a machinelearning algorithm (e.g., KNN, SVM) to identify and classifythe malware. For this comparison, we used a model based onDRBA with a 96*96 input image.

Table III shows the results of our approach compared withthe outcomes of the four combination models: GIST+KNN,GIST+SVM, GLCM+KNN, and GLCM+SVM. We can seethat our method had higher accuracy and faster detection




1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.93

0.02

0.00

0.00

0.01

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.03

0.96

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.86

0.00

0.00

0.00

0.00

0.01

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.07

0.00

0.00

0.02

0.00

0.00

0.00

0.51

0.27

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.01

0.00

0.00

0.00

0.11

0.06

0.00

0.00

0.00

0.00

0.00

0.01

0.01

0.00

0.02

0.24

0.56

0.00

0.00

0.01

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.08

0.09

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.96

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.96

0.00

0.00

0.02

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.88

0.09

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.04

0.00

0.00

0.12

0.89

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.01

0.01

0.00

0.00

0.01

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.99

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.01

0.02

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.12

0.07

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.56

0.43

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.11

0.10

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.24

0.40

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.12

0.00

0.00

0.00

0.00

0.02

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.93

Adialer.C

Agent.FYI

Allaple.A

Allaple.L

Alueron.gen!J

Autorun.K

C2LOP.gen!g

C2LOP.P

Dialplatform.B

Dontovo.A

Fakerean

Instantaccess

Lolyda.AA1

Lolyda.AA2

Lolyda.AA3

Lolyda.AT

Malex.gen!J

Obfuscator.AD

Rbot!gen

Skintrim.N

Swizzor.gen!E

Swizzor.gen!I

VB.AT

Wintrim.BX

Yuner.A

Adialer

.C

Agent

.FYI

Allaple

.A

Allaple

.L

Aluero

n.ge

n!J

Autor

un.K

C2LOP.g

en!g

C2LOP.P

Dialpla

tform

.B

Donto

vo.A

Faker

ean

Insta

ntac

cess

Lolyd

a.AA1

Lolyd

a.AA2

Lolyd

a.AA3

Lolyd

a.AT

Male

x.gen

!J

Obfus

cato

r.AD

Rbot!g

en

Skintri

m.N

Swizzor

.gen

!E

Swizzor

.gen

!I

VB.AT

Wint

rim.B

X

Yuner

.A

(a) 24*24

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.96

0.01

0.00

0.00

0.01

0.01

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.01

0.98

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.93

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.03

0.00

0.00

0.02

0.00

0.00

0.01

0.61

0.20

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.01

0.00

0.00

0.00

0.08

0.06

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.01

0.18

0.63

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.07

0.05

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.98

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.98

0.00

0.00

0.01

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.96

0.04

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.02

0.00

0.00

0.04

0.95

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.01

0.01

0.00

0.00

0.01

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.99

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.01

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.09

0.07

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.64

0.34

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.10

0.09

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.21

0.54

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.05

0.00

0.00

0.00

0.00

0.02

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.97

Adialer.C

Agent.FYI

Allaple.A

Allaple.L

Alueron.gen!J

Autorun.K

C2LOP.gen!g

C2LOP.P

Dialplatform.B

Dontovo.A

Fakerean

Instantaccess

Lolyda.AA1

Lolyda.AA2

Lolyda.AA3

Lolyda.AT

Malex.gen!J

Obfuscator.AD

Rbot!gen

Skintrim.N

Swizzor.gen!E

Swizzor.gen!I

VB.AT

Wintrim.BX

Yuner.A

Adialer

.C

Agent

.FYI

Allaple

.A

Allaple

.L

Aluero

n.ge

n!J

Autor

un.K

C2LOP.g

en!g

C2LOP.P

Dialpla

tform

.B

Donto

vo.A

Faker

ean

Insta

ntac

cess

Lolyd

a.AA1

Lolyd

a.AA2

Lolyd

a.AA3

Lolyd

a.AT

Male

x.gen

!J

Obfus

cato

r.AD

Rbot!g

en

Skintri

m.N

Swizzor

.gen

!E

Swizzor

.gen

!I

VB.AT

Wint

rim.B

X

Yuner

.A

(b) 48*481.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.99

0.01

0.00

0.00

0.01

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.99

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.98

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.01

0.00

0.00

0.00

0.00

0.00

0.00

0.69

0.17

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.01

0.00

0.00

0.00

0.03

0.04

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.16

0.68

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.03

0.03

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.99

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.99

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.98

0.01

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.01

0.00

0.00

0.02

0.99

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.01

0.00

0.00

0.00

0.01

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.99

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.06

0.07

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.74

0.31

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.07

0.08

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.20

0.62

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.02

0.00

0.00

0.00

0.00

0.01

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.99

Adialer.C

Agent.FYI

Allaple.A

Allaple.L

Alueron.gen!J

Autorun.K

C2LOP.gen!g

C2LOP.P

Dialplatform.B

Dontovo.A

Fakerean

Instantaccess

Lolyda.AA1

Lolyda.AA2

Lolyda.AA3

Lolyda.AT

Malex.gen!J

Obfuscator.AD

Rbot!gen

Skintrim.N

Swizzor.gen!E

Swizzor.gen!I

VB.AT

Wintrim.BX

Yuner.A

Adialer

.C

Agent

.FYI

Allaple

.A

Allaple

.L

Aluero

n.ge

n!J

Autor

un.K

C2LOP.g

en!g

C2LOP.P

Dialpla

tform

.B

Donto

vo.A

Faker

ean

Insta

ntac

cess

Lolyd

a.AA1

Lolyd

a.AA2

Lolyd

a.AA3

Lolyd

a.AT

Male

x.gen

!J

Obfus

cato

r.AD

Rbot!g

en

Skintri

m.N

Swizzor

.gen

!E

Swizzor

.gen

!I

VB.AT

Wint

rim.B

X

Yuner

.A

(c) 96*96

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.99

0.01

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.99

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.99

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.01

0.00

0.00

0.00

0.71

0.16

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.01

0.00

0.00

0.00

0.02

0.02

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.16

0.70

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.03

0.03

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.99

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.99

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.01

0.00

0.00

0.01

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.01

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.99

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.05

0.07

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.75

0.29

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.07

0.07

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.20

0.66

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.01

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

Adialer.C

Agent.FYI

Allaple.A

Allaple.L

Alueron.gen!J

Autorun.K

C2LOP.gen!g

C2LOP.P

Dialplatform.B

Dontovo.A

Fakerean

Instantaccess

Lolyda.AA1

Lolyda.AA2

Lolyda.AA3

Lolyda.AT

Malex.gen!J

Obfuscator.AD

Rbot!gen

Skintrim.N

Swizzor.gen!E

Swizzor.gen!I

VB.AT

Wintrim.BX

Yuner.A

Adialer

.C

Agent

.FYI

Allaple

.A

Allaple

.L

Aluero

n.ge

n!J

Autor

un.K

C2LOP.g

en!g

C2LOP.P

Dialpla

tform

.B

Donto

vo.A

Faker

ean

Insta

ntac

cess

Lolyd

a.AA1

Lolyd

a.AA2

Lolyd

a.AA3

Lolyd

a.AT

Male

x.gen

!J

Obfus

cato

r.AD

Rbot!g

en

Skintri

m.N

Swizzor

.gen

!E

Swizzor

.gen

!I

VB.AT

Wint

rim.B

X

Yuner

.A

(d) 192*192

Fig. 10. Confusion matrix of models with different malware image size

TABLE IIIPROPOSED MODEL COMPARED WITH OTHER MALICIOUS CODE DETECTION

METHODS

Methods Accuracy Precision Recall Running time

GIST+KNN 91.9 92.1 91.7 60msGIST+SVM 92.2 92.5 91.4 64ms

GLCM+KNN 92.5 92.7 92.3 45msGLCM+SVM 93.2 93.4 93.0 48msOur approach 94.5 94.6 94.5 20ms

speed. The performance of the other methods was somewhatpoorer because their use of GIST or GLCM was complex andtime-consuming.

VI. CONCLUSIONS AND FUTURE WORK

This paper proposed a novel method to improve the de-tection of malware variants through the application of deeplearning. First, this method transformed the malicious codeinto grayscale images. Next, the images were identified andclassified by a convolutional neural network that could extractthe features of the malware images automatically. Because ofthe effectiveness and efficiency of the CNN for identifyingmalware images, the detection speed of our model was sig-nificantly faster than speeds achieved by other approaches. Inaddition, our model provided an effective solution for the dataimbalance problem among different malware families. Ourexperimental results on 9,342 grayscale images of 25 malware

families showed that the proposed approach achieved 94.5%accuracy with good detection speed.

In this study, the CNN framework required all input imagesto have a fixed size, which limited our model. In future work,we would like to use the SPP-net model to allow images ofany size to be used as input. The SPP-net can extract featuresat variable scales thanks to a spatial pyramid pooling layer. Wecan introduce that layer into our model between the last sub-sampling layer and the fully connected layer to improve ourmodels flexibility. In addition, the transformation of maliciouscode into color images would be a good topic for futureresearch.

REFERENCES

[1] J. Bouvrie. Notes on convolutional neural networks. 2006.[2] X. Cai, X.-z. Gao, and Y. Xue. Improved bat algorithm with optimal

forage strategy and random disturbance strategy. International Journalof Bio-Inspired Computation, 8(4):205–214, 2016.

[3] X. Cai, H. Wang, Z. Cui, J. Cai, Y. Xue, and L. Wang. Bat algorithmwith triangle-flipping strategy for numerical optimization. InternationalJournal of Machine Learning and Cybernetics, 9(2):199–215, 2018.

[4] M. Christodorescu, S. Jha, S. A. Seshia, D. Song, and R. E. Bryant.Semantics-aware malware detection. In 2005 IEEE Symposium onSecurity and Privacy, pages 32–46. IEEE, 2005.

[5] Z. Cui, Y. Cao, X. Cai, J. Cai, and J. Chen. Optimal leach protocolwith modified bat algorithm for big data sensing systems in internetof things. Journal of Parallel and Distributed Computing, 2018.(doi:10.1016/j.jpdc.2017.12.014).

[6] Z. Cui, B. Sun, G. Wang, Y. Xue, and J. Chen. A novel oriented cuckoosearch algorithm to improve dv-hop performance for cyber–physicalsystems. Journal of Parallel and Distributed Computing, 103:42–52,2017.




[7] O. E. David and N. S. Netanyahu. Deepsign: Deep learning for automaticmalware signature generation and classification. In Neural Networks(IJCNN), 2015 International Joint Conference on, pages 1–8. IEEE,2015.

[8] D. Dıaz-Pernil, A. Berciano, F. Pena-Cantillana, and M. A. Gutierrez-Naranjo. Bio-inspired parallel computing of representative geometricalobjects of holes of binary 2d-images. International Journal of Bio-Inspired Computation, 9(2):77–92, 2017.

[9] H. Gao, Y. Du, and M. Diao. Quantum-inspired glowworm swarmoptimisation and its application. International Journal of ComputingScience and Mathematics, 8(1):91–100, 2017.

[10] J. R. Goodall, H. Radwan, and L. Halseth. Visual analysis of codesecurity. In Proceedings of the seventh international symposium onvisualization for cyber security, pages 46–51. ACM, 2010.

[11] K. Han, J. H. Lim, and E. G. Im. Malware analysis method usingvisualization of binary files. In Proceedings of the 2013 Research inAdaptive and Convergent Systems, pages 317–321. ACM, 2013.

[12] T. Isohara, K. Takemori, and A. Kubota. Kernel-based behavioranalysis for android malware detection. In 2011 Seventh InternationalConference on Computational Intelligence and Security (CIS), pages1011–1015. IEEE, 2011.

[13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture forfast feature embedding. In Proceedings of the 22nd ACM internationalconference on Multimedia, pages 675–678. ACM, 2014.

[14] A. Khoshkbarforoushha, A. Khosravian, and R. Ranjan. Elasticitymanagement of streaming data analytics flows on clouds. Journal ofComputer and System Sciences, 89:24–40, 2017.

[15] A. Khoshkbarforoushha, R. Ranjan, R. Gaire, E. Abbasnejad, L. Wang,and A. Y. Zomaya. Distribution based workload modelling of continuousqueries in clouds. IEEE Transactions on Emerging Topics in Computing,5(1):120–133, 2017.

[16] P. Li, J. Zhao, Z. Xie, W. Li, and L. Lv. General central firefly algorithmbased on different learning time. International Journal of ComputingScience and Mathematics, 8(5):447–456, 2017.

[17] Q. Miao, R. Liu, Y. Quan, and J. Song. Remote sensing image fusionbased on shearlet and genetic algorithm. International Journal of Bio-Inspired Computation, 9(4):240–250, 2017.

[18] A. Moser, C. Kruegel, and E. Kirda. Limits of static analysis for malwaredetection. In Computer security applications conference, 2007. ACSAC2007. Twenty-third annual, pages 421–430. IEEE, 2007.

[19] L. Nataraj, S. Karthikeyan, G. Jacob, and B. Manjunath. Malwareimages: visualization and automatic classification. In Proceedings of the8th international symposium on visualization for cyber security, page 4.ACM, 2011.

[20] L. Nataraj, V. Yegneswaran, P. Porras, and J. Zhang. A comparativeassessment of malware classification using binary texture analysis anddynamic analysis. In Proceedings of the 4th ACM Workshop on Securityand Artificial Intelligence, pages 21–30. ACM, 2011.

[21] J. Schmidhuber. Deep learning in neural networks: An overview. Neuralnetworks, 61:85–117, 2015.

[22] Symantec. Internet security threat report, 2017.[23] P. Trinius, T. Holz, J. Gobel, and F. C. Freiling. Visual analysis of

malware behavior using treemaps and thread graphs. In Visualizationfor Cyber Security, 2009. VizSec 2009. 6th International Workshop on,pages 33–38. IEEE, 2009.

[24] M. Wagner, F. Fischer, R. Luh, A. Haberson, A. Rind, D. A. Keim,and W. Aigner. A survey of visualization systems for malware analysis.In Eurographics Conference on Visualization (EuroVis), pages 105–125,2015.

[25] G.-G. Wang, X. Cai, Z. Cui, G. Min, and J. Chen. High performancecomputing for cyber physical social systems by using evolutionarymulti-objective optimization algorithm. IEEE Transactions on EmergingTopics in Computing, 2017.

[26] L. Wang, H. Geng, P. Liu, K. Lu, J. Kolodziej, R. Ranjan, and A. Y.Zomaya. Particle swarm optimization based dictionary learning forremote sensing big data. Knowledge-Based Systems, 79:43–50, 2015.

[27] X.-S. Yang. A new metaheuristic bat-inspired algorithm. Nature inspiredcooperative strategies for optimization (NICSO 2010), pages 65–74,2010.

[28] Y. Ye, T. Li, D. Adjeroh, and S. S. Iyengar. A survey on malwaredetection using data mining techniques. ACM Computing Surveys(CSUR), 50(3):41, 2017.

[29] I. Yoo. Visualizing windows executable viruses using self-organizingmaps. In Proceedings of the 2004 ACM workshop on Visualization anddata mining for computer security, pages 82–89. ACM, 2004.

[30] Z. Yuan, Y. Lu, Z. Wang, and Y. Xue. Droid-sec: deep learning in an-droid malware detection. In ACM SIGCOMM Computer CommunicationReview, volume 44, pages 371–372. ACM, 2014.

Zhihua Cui is a Professor of School of ComputerScience and Technology at Taiyuan University ofScience and Technology, China. He received hisPh.D. in Control Theory and Engineering fromXi’an Jiaotong University, China in 2008. He isthe Editor-in-Chief of International Journal of Bio-inspired Computation. His research interest includescomputational intelligence, stochastic algorithm, andcombinatorial optimization.

Fei Xue is a Lecturer at the School of Informa-tion, Beijing Wuzi University. He received the M.S.degree in Computer Application Technology fromTaiyuan University of Science and Technology, Chi-na in 2011. He received the P.D. degree in ComputerScience and Technology from Beijing University ofTechnology, China in 2016. His current researchinterests include swarm intelligence and networksecurity.

Xingjuan Cai is a Associate Professor of Schoolof Computer Science and Technology at TaiyuanUniversity of Science and Technology, China. Shereceived her Ph.D. in Control Theory and Engi-neering from Tongji University, China in 2017. Herinteresting includes bio-inspired computation andapplications.

Yang Cao is a Ph.D. candidate of College ofComputer Science, Beijing University of Technol-ogy. He received his B.S. degree (2011) and M.S.degree (2015) in Computer Science and Technologyfrom Taiyuan University of Science and Technology,China. His main research interests include swarmintelligence, network security, and big data analysis.

Gai-ge Wang is an associate professor in OceanUniversity of China, China. He served as AssociateEditor of IJCISIM, and an Editorial Board Mem-ber of IJBIC from 2016. His research interests areswarm intelligence, evolutionary computation, andbig data optimization. He just proposed five bio-inspired algorithms (monarch butterfly optimization,earthworm optimization algorithm, elephant herd-ing optimization, moth search algorithm, and rhinoherd).

Jinjun Chen received the Ph.D. degree in computerscience and software engineering from SwinburneUniversity of Technology, Australia. He is an asso-ciate professor in the Faculty of Engineering and IT,University of Technology Sydney (UTS), Australia.He is the director of Lab of Cloud Computing andDistributed Systems. His research interests includecloud computing, big data, workflow management,privacy and security, and related various researchtopics. He is a senior member of the IEEE.

ieee transactions on industrial informatics,...

Documents