appfa: a novel approach to detect malicious...

16
Research Article AppFA: A Novel Approach to Detect Malicious Android Applications on the Network Gaofeng He, 1 Bingfeng Xu , 2 and Haiting Zhu 1 1 School of Internet of ings, Nanjing University of Posts and Telecommunications, Nanjing 210003, China 2 College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China Correspondence should be addressed to Bingfeng Xu; [email protected] Received 21 October 2017; Revised 29 January 2018; Accepted 13 March 2018; Published 17 April 2018 Academic Editor: Georgios Kambourakis Copyright © 2018 Gaofeng He et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. We propose AppFA, an Application Flow Analysis approach, to detect malicious Android applications (simply apps) on the network. Unlike most of the existing work, AppFA does not need to install programs on mobile devices or modify mobile operating systems to extract detection features. Besides, it is able to handle encrypted network traffic. Specifically, we propose a constrained clustering algorithm to classify apps network traffic, and use Kernel Principal Component Analysis to build their network behavior profiles. Aſter that, peer group analysis is explored to detect malicious apps by comparing apps’ network behavior profiles with the historical data and the profiles of their selected peer groups. ese steps can be repeated every several minutes to meet the requirement of online detection. We have implemented AppFA and tested it with a public dataset. e experimental results show that AppFA can cluster apps network traffic efficiently and detect malicious Android apps with high accuracy and low false positive rate. We have also tested the performance of AppFA from the computational time standpoint. 1. Introduction Recently, the mobile platform has gained more and more popularity, and there are a large amount and wide variety of feature-rich mobile applications (or apps) that users can install and experience. As an example, the Google Play store has already had more than 3 million apps by Sep. 2017 [1]. With these feature-rich apps, mobile users can work, play games, and communicate with each other anytime and anywhere. Meanwhile, since a majority of these apps need Internet access, they are bringing challenges to network security and management. For instance, malicious apps can compromise mobile users’ privacy and steal users’ confiden- tial data [2, 3]. As indicated in [4], more than 70% of known malicious mobile apps (also known as mobile malware) steal user credentials and information. Besides, mobile malware is also proposing new challenges to security protection of enterprise networks. Malicious apps may be part of many bots and cause a dramatically increasing influence on the network traffic (DoS attack) [5]. Researchers have done extensive work to detect malicious mobile apps and several methods have been proposed and evaluated [6–12]. Commonly, in order to detect malicious mobile apps, several steps should be done. First, detection fea- tures such as user’s operating behavior, API usage, and appli- cation network behavior should be defined and extracted. en detection models are constructed and, finally, new apps are compared with the constructed models for mobile malware detection. Depending on the locations where these steps are performed, current approaches can be generally categorized into two main groups: client-side and server- side detection. For client-side detection, these steps are all performed at mobile devices, while, for server-side detection, the main steps are carried out on remote servers. Even though server-side detection approaches are con- ducted remotely, their detection features are mainly collected and processed on mobile devices and later sent to remote servers for modeling and detection [13]. erefore, current approaches are required to install some kind of program on mobile devices or modify operating systems (e.g., modify Android source code) to collect detection feature informa- tion. Obviously, this will increase energy consumption of mobile devices. Also, these methods will be difficult to be applied for mobile device protection in large organizations. Hindawi Security and Communication Networks Volume 2018, Article ID 2854728, 15 pages https://doi.org/10.1155/2018/2854728

Upload: others

Post on 30-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AppFA: A Novel Approach to Detect Malicious …downloads.hindawi.com/journals/scn/2018/2854728.pdfSecurityandCommunicationNetworks F :Exampleofkey-valuepairs. signs, and the packet

Research ArticleAppFA A Novel Approach to Detect Malicious AndroidApplications on the Network

Gaofeng He1 Bingfeng Xu 2 and Haiting Zhu1

1School of Internet of Things Nanjing University of Posts and Telecommunications Nanjing 210003 China2College of Information Science and Technology Nanjing Forestry University Nanjing 210037 China

Correspondence should be addressed to Bingfeng Xu bingfengxunjfueducn

Received 21 October 2017 Revised 29 January 2018 Accepted 13 March 2018 Published 17 April 2018

Academic Editor Georgios Kambourakis

Copyright copy 2018 Gaofeng He et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

We proposeAppFA anApplication FlowAnalysis approach to detectmaliciousAndroid applications (simply apps) on the networkUnlike most of the existing work AppFA does not need to install programs on mobile devices or modify mobile operating systemsto extract detection features Besides it is able to handle encrypted network traffic Specifically we propose a constrained clusteringalgorithm to classify apps network traffic and use Kernel Principal Component Analysis to build their network behavior profilesAfter that peer group analysis is explored to detect malicious apps by comparing appsrsquo network behavior profiles with the historicaldata and the profiles of their selected peer groups These steps can be repeated every several minutes to meet the requirement ofonline detection We have implemented AppFA and tested it with a public dataset The experimental results show that AppFA cancluster apps network traffic efficiently and detect malicious Android apps with high accuracy and low false positive rate We havealso tested the performance of AppFA from the computational time standpoint

1 Introduction

Recently the mobile platform has gained more and morepopularity and there are a large amount and wide varietyof feature-rich mobile applications (or apps) that users caninstall and experience As an example the Google Play storehas already had more than 3 million apps by Sep 2017[1] With these feature-rich apps mobile users can workplay games and communicate with each other anytime andanywhere Meanwhile since a majority of these apps needInternet access they are bringing challenges to networksecurity and management For instance malicious apps cancompromise mobile usersrsquo privacy and steal usersrsquo confiden-tial data [2 3] As indicated in [4] more than 70 of knownmalicious mobile apps (also known as mobile malware) stealuser credentials and information Besides mobile malwareis also proposing new challenges to security protection ofenterprise networksMalicious appsmay be part ofmany botsand cause a dramatically increasing influence on the networktraffic (DoS attack) [5]

Researchers have done extensive work to detectmaliciousmobile apps and several methods have been proposed and

evaluated [6ndash12] Commonly in order to detect maliciousmobile apps several steps should be done First detection fea-tures such as userrsquos operating behavior API usage and appli-cation network behavior should be defined and extractedThen detection models are constructed and finally newapps are compared with the constructed models for mobilemalware detection Depending on the locations where thesesteps are performed current approaches can be generallycategorized into two main groups client-side and server-side detection For client-side detection these steps are allperformed at mobile devices while for server-side detectionthe main steps are carried out on remote servers

Even though server-side detection approaches are con-ducted remotely their detection features are mainly collectedand processed on mobile devices and later sent to remoteservers for modeling and detection [13] Therefore currentapproaches are required to install some kind of program onmobile devices or modify operating systems (eg modifyAndroid source code) to collect detection feature informa-tion Obviously this will increase energy consumption ofmobile devices Also these methods will be difficult to beapplied for mobile device protection in large organizations

HindawiSecurity and Communication NetworksVolume 2018 Article ID 2854728 15 pageshttpsdoiorg10115520182854728

2 Security and Communication Networks

As illustrated in literature [14] it is hard to ensure that allmobile devices have installed information collection pro-grams and it is impractical tomanually audit every employeersquospersonal device due to the privacy issue and also the largeamount of mobile devices

In this paper we propose AppFA (App Flow Analy-sis) a novel approach to detect malicious Android appsfrom network traffic Contrary to client-side and server-sidemethods AppFA is implemented at the network level andthus it is a new kind of network-side detection approachNotably AppFA does not need to install programs or modifyoperating systems to extract detection features therefore it islightweight and easy to deploy We notice that literature [15]is also a kind of network-side approach while it only analyzesHTTP traffic and cannot be applied to encrypted networktraffic Previous works [6 7] have also proposed methods todetect mobile malware from network traffic These methodsneed offline training and install programs (such as VPNproxy) on mobile devices to get to know exactly what flowscome fromwhat app for traffic labellingThus they still belongto client-side or server-side approaches

In this work AppFA can analyze encrypted networktraffic through network behavior profile construction Theapps traffic is clustered by constrained clustering thus we donot need to install programs on mobile device to determinethe origin of apps traffic We also use peer group analysis toavoid offline model training Our main contributions are thefollowing

(i) Providing a lightweight and efficient framework fordetecting malicious Android apps on the network

(ii) Proposing an efficient algorithm for clusteringmobileapps network traffic

(iii) Outlining a method for detecting malicious apps byconstructing network behavior profile and using peergroup analysis

(iv) Carrying out extensive experiments with the publicdataset

The rest of this work is structured as follows Themotivation of our work is presented in Section 2 In Sec-tion 3 we discuss relevant related work in detail Section 4introduces the architecture and main components of AppFAThe methodology is presented in Section 5 Experimentalevaluation and discussion of the proposed methods arepresented in Section 6 Finally in Section 7 we conclude thepaper with a discussion of potential future work

2 Motivation and Observations

As investigated by Statista Android accounts for morethan 86 of the global mobile OS market until the 1stquarter of 2017 [16] The popularity of Android devicesmakes it a desirable target For example the top 20 mobilemalware programs are all related to Android [17] One ofthe reasons for the popularity of Android malware may bethat Android app package elements can easily be modi-fied by third parties [18] With open-source tools such as

apktool (httpsgithubcomiBotPeachesApktool) and jadx(httpsgithubcomskylotjadx) malware writers can easilygraft some malicious code on popular apps to ensure awide diffusion of their malicious code As an evidenceMalGenome [19] a reference dataset in the Android securitycommunity and also used in our experiments has 80 of themalicious samples known to be built via repackaging otherapps Therefore in this work we mainly focus on detectionof Android malicious repackaged apps

It is popular to detect Android malware by code andresource file analysis [20] while there are very few studiesconsideringmalicious Android apps detection at the networklevel In order to detect Android malware on the networkwe have analyzed network traffic of Android apps carefullyand obtained several observations The first observationis that network behaviors of repackaged apps are signifi-cantly different from those of their original versions Thisobservation is also validated by Shabtai et al [7] Takingthe Android malware AnserverBot [21] as an example net-work behaviors of the repackaged app and the original app(comcamelgamesmxmotor) are compared in Table 1

For comparison the two apps were run in a real phonerespectively and the network traffic was collected in the first5 minutes After that the total packet sizes and the amountof tcp connections were calculated as network behaviorsObviously as shown in Table 1 there is a clear differencebetween the network behavior of the repackaged and originalapps particularly the total packet sizes of the repackaged appare significantly larger than the original app (23Mb versus1932 kb)

We further compared appsrsquo network behaviors with theirsimilar apps Again taking comcamelgamesmxmotor as theexample we selected its 3 similar apps from the Google Playstore The chosen strategies are described as flows we firstsearched the key word ldquomotordquo in the Google Play store thenwe selected the top 3 game apps from the search results as thesimilar ones The network behaviors of these apps are com-pared as in Table 2 Apparently their network behaviors areclose to each other This phenomenon of similar apps havingsimilar network behaviors is also validated in [7] These pre-liminary observations of network behaviors of Android appsgive us clues for detection of Android repackaged malware

We also observed that more and more Android appsadopt encrypted network connections to transmit databetween smart devices and remote servers For instance combaiduBaiduMap generates several SSL flows at the startupstage as shown in Figure 1 Therefore methods for detectingAndroid repackaged malware should handle encrypted net-work traffic

Based on the observations above we propose to detectmalicious Android apps by comparing appsrsquo network behav-iors with their historical data and the ones of their similarapps Comparison with the historical data can help us findout self-updating malicious apps and comparison with thebehaviors of similar apps can detect other types of repackagedmalware Self-updating is a new technique for repackagingapps and it cannot be detected by applying regular static ordynamic analysis methods [7] The details of network behav-ior construction and similar app selection are illustrated

Security and Communication Networks 3

Figure 1 Example of network flows of Android app combaiduBaiduMap The traffic is captured and analyzed by NetworkminerNote that there exists several encrypted SSL flows and the numberis more than HTTP traffic

Table 1 Comparison of network behaviors of original appcomcamelgamesmxmotor and its repackaged version (repackagedby AnserverBot)

Total packet sizes Number of tcp connectionsRepackaged app 23Mb 18Original app 1932 kb 11

Table 2 Comparison of network behaviors of similar apps

Total packetsizes

Number of tcpconnections

comcamelgamesmxmotor 1932 kb 11aircomaceviralmotox3m 1828 kb 14comskgamestrafficrider 1751 kb 13comtopfreegamesbikeracefreeworld 1327 kb 9

in Section 5 Meanwhile a novel constrained clusteringalgorithm is elaborated for app traffic clustering Thus ourmethod can be applied on the network straightforwardly anddoes not need to install programs onmobile devices to collectflow information

3 Related Work

There has been extensive work on detectingmaliciousmobileapps Literature [4 5 22 23] gave surveys of mobile malwarein the wild and the proposed techniques for detecting themIn this section we mainly focus on behavior-based malwaredetection methods and only review the most related ones

Generally current behavior-basedmobilemalware detec-tion approaches can be categorized into two main groupsclient-side and server-side detection Client-side detectionapproaches run locally and apply anomalymethods on the setof features which indicate the state of the appThe pBMDS [8]is based on correlating user inputs with system calls to detectanomalous activities A Hidden Markov Model (HMM)is used to learn application and user behaviors from twomajor aspects process state transitions and user operationalpatterns Built upon these two aspects the pBMDS identifiesbehavioral differences between user initiated applicationsand malware compromised ones Zhang et al [11] combineddynamic tracing of the permission requests for resources

usage by applications with tracking sensitive operations onthe granted resources (using taint tracking) This combina-tion enabled them to understand how applications utilize thepermissions to access sensitive system resources Dai et al[24] presented a malware detection system for the WindowsMobile platform They used API interception techniquesfor monitoring and analyzing the applicationrsquos behavior andcompared it to the patterns within the predefined library ofmalicious behavior characteristics Shabtai et al [7] presenteda behavior-based anomaly detection system for detectingmeaningful deviations in a mobile applicationrsquos networkbehavior Semisupervised C45 Decision Tree algorithm wasused for learning the normal behavioral patterns and fordetecting deviations from the applicationrsquos expected behaviorTheir methods were implemented and evaluated on Androiddevices Damopoulos et al [25] proposed a fully fledgedtool able to dynamically analyze any iOS software in termsof method invocation that can be used to trace softwarersquosbehavior to decide if it contains malicious code or not

Server-side detection approaches are carried out onremote servers mainly motivated by limited computationalresources of the mobile device [26] Burguera et al [9] havedeveloped an Android framework named ldquoCrowdroidrdquo thatincludes a client application installed on the deviceThe appli-cationmonitors Linux kernel system calls and sends them to acentralized server after preprocessingOn the server a datasetis built from the list of the system calls the list of runningapplications and the device information 119870-Means algorithmis then used for clustering the applications into two groupsthat is benign and malware applications Shamili et al [13]utilize a distributed Support Vector Machine algorithm formalware detection on a network ofmobile devicesThephonecalls SMSs and data communication related features areused for detection During the training phase support vectors(SVs) are learned locally on each device and then sent to theserver where SVs from all of the client devices are aggregatedFinally the server distributes the whole set of SVs to all theclients and each client updates his own SVs

Unlike the work mentioned above our proposed systemruns at the network level directly without necessarily havingaccess to the mobile devices We notice that the most similarwork was carried out by Chen et al [15] Chen et alrsquos method[15] was also implemented at the network level and identifiedabnormal network behaviors by conducting 3-step checkaction including (1) identifying HTTP POST and HTTPGET packages (2) checking whether the device was exposingunique device identifiers such as IMEI and IMSI and (3)determining the legitimacy of the remote server by queryingthe domain name server So their method can only be appliedto HTTP traffic Contrary to the work of Chen et al [15] weuse the combination of signature matching and constrainedmobile network traffic clustering for app identification andthus our method is suitable for both plaintext and ciphertexttraffic such as HTTPS The work introduced in [6] is alsointended to detect Android malware from network traffic InGarg et al [6] detection features were first extracted fromDNS HTTP and TCP traffic and then machine learningalgorithms (such as Decision Trees Bayesian Networks andRandom Forests) were used to detect malicious mobile apps

4 Security and Communication Networks

Packet Filter SessionBuilder

FlowSession

FlowSession

PacketContentExtractor

Basic FeatureExtractor

IdentificationFeatureGenerator

App TrafficClustering

ProfileFeatureGenerator

AppBehaviorProfileConstructor

MaliciousAppDetection

Appsession

Figure 2 Main components of AppFA

while their detection features were obtained in the mobiledevice and they needed to train classification models offlineIn AppFA we straightforwardly extract detection features onthe network to build appsrsquo network behavior profiles and usepeer group analysis to avoid offline model training

Recently there are an increasing number of researchworks that analyze network traffic to identify app such as[27ndash32] However most of them were focused on plaintextflows (eg HTTP) and tried to collect identification featuresfrom HTTP headers These methods may fail due to theemergence of encrypted network traffic In this paper weinvestigate a new constrained clusteringmethod to cluster thenetwork traffic generated by the same app

4 System Design

The system architecture of AppFA is shown in Figure 2 It isof modular design and each module has special functionsPacket Filtermodule captures link-layer frames or reads themfrom a file and filters them according to configurable rules Inorder to detect malicious apps online and provide support fornetwork management AppFA is designed to analyze the first119899 nonzero packets (packets contain application data) of eachnetwork flow The parameter 119899 is configurable for differentnetwork management purpose For example the value of 119899can be set to a small positive number such as 50 for real-timemalicious app detection and minus1 for full analysis namely con-sidering all nonzero packets in flows In AppFA the packetfilter module can be implemented based on well-knownlibrary such as libpcap (httpwwwtcpdumporgrelease)

Session Builder module organizes network traffic intosessions For app identification and malicious behavior

detectionwe define two types of sessionsflow session and appsession The flow session is defined by the source IP sourceport destination IP destination port transport protocoltuple where source and destination can be swapped andthe transport protocol is mainly considered as TCP andUDP in this work In deployment AppFA determines whena flow session is completed by one of the following threeconditions (1) received 119899 nonzero packets (2) detectingRESTFINpacket (3) timeout for example there is no packetexchanging in 3 minutes With flow sessions one can extractbasic features and packet contents efficiently for app identifi-cation and malicious behavior detection The app session isdefined as the collection set of all flow sessions gen-erated by the same app flow session 1 flow session 2 flow session 119898 The app sessions constructed by SessionBuilder module will be invoked by Profile Feature Generatormodule to construct app network behavior profiles asdepicted in the right of Figure 2

Basic Feature Extractor module extracts basic packetfeatures from flow sessions which include packet sizepacket interarrival time packet order and packet directionThe packet direction feature distinguishes outgoing fromincoming packets Other advanced features such as flowduration total packets sent and received and burst sizescommonly used in traffic analysis [33ndash35] can be calculatedfrom these basic features So we extract these basic featuresfirstly and then generate appropriate advanced features (iden-tification or detection features) for different traffic analysispurposes (clustering or peer group comparing) After thebasic feature extraction a flowmay look like +50 lowast30 +100lowast500 minus1300 lowast20 minus400 where outgoing packet sizes andincoming packet sizes are denoted by positive and negative

Security and Communication Networks 5

Figure 3 Example of key-value pairs

signs and the packet interarrival times are labelled byasterisk The packet order features are also reflected by thenumber sequences inherently For example for the flow+50 lowast30 +100 lowast500 minus1300 lowast20 minus400 the first packetis an outgoing packet whose size is 50 bytes and the secondpacket is also an outgoing packet and the packet time intervalis 30ms The third packet is an incoming packet whose sizeis 1300 bytes The time interval between the second and thethird packet is 500ms and so forth

Note that the packet contents are equally saved throughPacket Content Extractor module for app identification Sim-ilar to work [30] we focus on key-value pairs in HTTP head-ers In detail justniffer (httpjustniffersourceforgenet) isfirst used to transform the raw packet traffic into HTTPmessages Then HTTP messages are tokenized by severaltokenizers such as space ldquornrdquo ldquordquo and ldquoamprdquo and eachHTTPrequest will be broken into various parts including methodpage and query Finally queries are divided into key-valuepairs Figure 3 is an example of key-value pairs used in ourexperiments

After obtaining the basic features from flows we beginto generate identification features in Identification FeatureGeneratormodule After that wewill identify the app for eachflow through App Traffic Clustering module which returnsthe clustering results to Session Builder module for formingapp sessionsWith the app session information detection fea-tures will be created in Profile Feature Generator module Atlast app network behavior profiles will be constructed inAppBehavior Profile Constructor module and malicious apps willbe detected inMalicious App Detectionmodule respectivelyThe independent treatment of different functional modulescan make AppFA architecture clear and scalable

5 Methodology

As depicted in Figure 2 the functionality of app trafficclustering and malicious app detection (including identifica-tiondetection feature generation) are the core componentsof AppFA and the technical details are clarified in the nextsubsections

51 App Traffic Clustering The basic idea of our app trafficclustering method is illustrated in Figure 4 In the figurethere are two apps and the flow sessions of them are repre-sented by solid and dotted lines respectively For each flowsignature matching is first used to identify HTTP flows thathave recognized signatures Note that in this paper we usethe term signature for plaintext matching and feature for

traffic analysis With this step there are two flows identified(labelled as red and green) as shown in the left dashed box ofFigure 4 After the signaturematching constrained clusteringalgorithm (the second dashed box in Figure 4) is exploitedto cluster all flows Compared to the ordinary clusteringalgorithms constrained clustering algorithm adopts back-ground information (ie identified flows) to improve clusteraccuracy By constrained clustering flows such as encryptedtraffic that cannot be identified by signatures will be classifiedinto appropriate clusters that is apps Finally we obtainapp sessions that will be utilized for creating appsrsquo networkbehavior profiles as depicted in the last step in Figure 4

Formally the entire app identification procedure isdescribed in Algorithm 1 In the while loop the clusteringsignal can be a timeout for cyclic identification (eg every1 hour) or an app session completed for real-time identifica-tion For the details of selecting the initial signature seeds 119878119889one can refer to literature [30]

Algorithm 1 uses a method similar to the FLOWR system[30] to carry out signature matching (line (6)) and takesadvantage of constrained 119870-means clustering algorithm [36]for flow clustering (line (7)) For signature matching thekey-value pairs in HTTP header are considered as apprsquossignatures and an initial set of seeding app signatures isset up to bootstrap the learning of new ones Comparedto FLOWR we refine the process of counting cooccurrenceof app signatures with the constrained clustering results InFLOWR if start time of 119891119897o1199082 is less than 119879 seconds afterthe start time of 1198911198971199001199081 their signatures will be consideredas a cooccurrence instance However as noted in literature[30] if119879 is overestimated FLOWR ismore likely tomix flowsfrom different apps thus inducing noise and overutilizingsystem resources To overcome this problem in AppFA wefurther consider the clustering results to count cooccurrenceof app signatures besides temporal information That meansonly the flows that occurred in 119879 seconds in the same clusterwill be counted as cooccurrenceThis will reduce noises sincethe flows are filtered by constrained clustering

After signature matching constrained 119870-means cluster-ing is carried out to identify the remaining unknown flowsAlgorithm 2 shows the constrained clustering algorithmexploited in AppFA In the constrained flow clustering theflows identified by signature matching which belong to thesame app must be clustered into one cluster (must-link con-straints lines (4)ndash(8) in Algorithm 2) and those generatedby different apps must be clustered into different clusters(cannot-link constraints lines (9)ndash(13) in Algorithm 2) Theclustering features are listed in Table 3 (totally 11 features)Weselect the time of the first packet sent as one of the featuresbecause flows observed within short time intervals are likelyto come from the same app [32]The other features are chosenas they are proved to be efficient in clustering network traffic[36]

In Algorithm 2 we set the value of cluster number 119901 to beequal to the number of apps identified by signaturematchingThis is because the accuracy of mobile app identification isalready higher than 95 [29 32] and it is proper to assumethat the popular apps (malware writers usually graft somemalicious code on popular apps to ensure a wide diffusion

6 Security and Communication Networks

Flow sessionspacket featuresand contents

SignatureMatching

ConstrainedClustering

App sessions

App 1

App 2

Figure 4 Illustration of the basic idea of app identification Solid and dotted lines represent different apps and black color indicates that theflows have not been identified

(1) let 119878 denotes signature set and 119878119889 stands for the signatureseeds

(2) while the clustering signal is received do(3) if 119878 is empty then(4) 119878 = 119878119889(5) end if(6) carry out signature matching(7) carry out constrained flow clustering(8) update 119878 with the clustering results(9) end while

Algorithm 1 App traffic clustering procedure

Table 3 Identification feature set

Time of the first packet sentNumber of packetsVolume of bytesMin max mean and std dev of packet sizeMin max mean and std dev of interpacket time

Table 4 Network behavior profile feature set

Number of flowsNumber of packets of outgoing and incoming flowsVolume of bytes of outgoing and incoming flowsFirst 1198961 results of KPCA transformations of outgoing packet sizesFirst 1198962 results of KPCA transformations of incoming packet sizesFirst 1198961 results of KPCA transformations of outgoing packetintervalsFirst 1198962 results of KPCA transformations of incoming packetintervals

of their malicious code) can all be recognized Thereforeeach cluster will correspond to one app when the constrained119870-means clustering is finished This is appropriate for thefollowing network behavior profile construction and mali-cious behavior detection In practice if the actual numberof apps is greater than 119901 namely some apps cannot beidentified by signature matching the corresponding flowswill be misclustered This may change network behaviors ofapps since unrelated flows will be included and the traffic

features such as the number of packets and the volume ofbytes will be enlarged In this work we use Kernel PrincipalComponent Analysis to remit this problem as described inthe following subsection

52 Network Behavior Profile Construction After the appidentification the network behavior profiles for apps areconstructed from app sessions In AppFA we define the apprsquosnetwork behavior profile as a set of chosen network trafficfeatures as listed in Table 4 Formally the network behaviorprofile can be defined as follows

119899119890119905119861119890ℎ119886V119894119900119903119875119903119900119891119894119897119890 fl 1198911198901198861199051199061199031198901 119891119890119886119905119906119903119890119899 119891119890119886119905119906119903119890119894 isin 119888119900119899119899119890119888119905119894119900119899 119891119890119886119905119906119903119890119904 119889119886119905119886 119891119890119886119905119906119903119890119904

(1)

Since malicious apps need to establish network connec-tions to transmit confidential data or carry out defined attacksteps [18] in this work we mainly choose connection featuresand data features for constructing appsrsquo network behaviorprofiles as shown in (1) The connection features describehow many network connections have been established andthe data features represent characteristics of packets Thewhole selected features for consisting of appsrsquo network behav-ior profiles are listed in Table 4 In the table the first two linesare connection features and the rest are data features

Furthermore for overcoming the misclassification prob-lem (as illustrated in Section 51) and distinguishing minortraffic variations from significant differences we do notuse these features directly Instead KPCA (Kernel PrincipalComponent Analysis) is applied to transform basic featuressuch as packet size and packet time interval Basically KPCA

Security and Communication Networks 7

Input Data set 119865 cluster number 119901 must-link constraints119871119898 sube 119865 times 119865 cannot-link constraints 119871 119888 sube 119865 times 119865

Output Flow clusters(1) Let 1198881 sdot sdot sdot 119888119901 be the initial cluster centers(2) for each flow 119891119894 in 119865 do(3) select the closest cluster 119888(4) for each (119891119894 119891) isin 119871119898 do(5) if119891 notin 119888 then(6) goto step (2)(7) end if(8) end for(9) for each (119891119894 119891) isin 119871 119888 do(10) if119891 isin 119888 then(11) goto step (2)(12) end if(13) end for(14) assign 119891119894 to the cluster 119888(15) end for(16) for each cluster 119888119894 do(17) update its center by averaging all of the flows 119891119895 that

have been assigned to it(18) end for(19) iterate between step (2) and step (18) until convergence(20) return 1198881 sdot sdot sdot 119888119901

Algorithm 2 Constrained flow clustering algorithm

is one approach of generalizing linear PCA into nonlinearcase using the kernel method It has been proved that KPCAhas the best performance in feature extraction and is robust tonoise [37] The details of KPCA used in AppFA are describedas follows

First Gaussian function defined in (2) is selected as thekernel function

119892 (119909 119910) = exp (minus120574 1003817100381710038171003817119909 minus 11991010038171003817100381710038172) (2)

where the value of 120574 is set to 0001 as indicated in [37]Then we compute a Gramkernel matrix 119870 with

119870119894119895 = 119892 (119909(119894) 119909(119895)) (3)

Next the kernel matrix 119870 is centered via the followingfunction

119870centered = 119870 minus 1119873119870 minus 1198701119873 + 1119873119870= (119868 minus 1119873) 119870 (119868 minus 1119873)

(4)

where 1119873 is an 119873 times 119873 matrix with all elements equal to 1119873and 119873 is the number of data points

After that the nonzero eigenvalues 120582119894 and the eigenvec-tors 120572119894 of the centered kernel matrix 119870centered are calculated asfollows

119870centered120572119894 = 120582119894120572119894 (5)

Also the eigenvectors 120572119894 are normalized as

120572119894 =1

radic120582119894119873120572119894 (6)

Finally we sort the eigenvectors in the descending orderof corresponding eigenvalues and perform projections ontothe given subset of eigenvectors This step can be representedas follows

119889119895 =119873

sum119894=1

120572119895119894119892 (119909 119909119894) 119895 = 1 2 119896 (7)

where 119896 is the dimension of the new dataFor each app the length of its network behavior profile is

(1 + 2 lowast (119891119900 + 119891119894) + 2 lowast (1198961 + 1198962)) and 119891119900 is the number ofoutgoing flows and 119891119894 is the number of incoming flows

53 Malicious App Detection With appsrsquo network behaviorprofiles we propose to use peer group analysis [38] to detectmalicious apps The main idea of the detection method isillustrated in Figure 5 An apprsquos network behavior profileis compared with both its historical data and the profilesof its peer group for malware detection Comparison withthe historical profiles can help us find out self-updatingmalicious apps [7] and comparison with the profiles of itspeer group can detect repackagedmalicious apps as observedin Section 2

For AppFA an apprsquos peer group is defined as the set of itsbehavior-similar apps Apps are considered behavior-similarif they satisfy the following two conditions A their mainfunctionality is similar for example all for mailing B theirnetwork behavior profiles are similar If any of the above twoconditions is not met the peer group will be empty and wewill only compare appsrsquo profiles with their historical data

8 Security and Communication Networks

App1 for detection

Update the peergroup of App1

Detection

App1 lt12 100 5000 gt

Network behaviorprofile comparison

Network behaviorprofile comparison

Historical data Peer grouplt11 110 5020 gt

lt10 130 4600 gt

lt12 100 5000 gt

lt12 101 5010 gt

e profile is similar to thehistory App1 has not beeninjected in malicious code

App2 lt20 400 6000 gt

App3 lt30 500 7000 gt

App4 lt25 450 6500 gt

App5 lt19 370 5580 gt

e profile is significantly differentfrom its peer group (behavior-similar apps) App1 may be a fakeapp

Figure 5 Illustration of the procedure of malicious app detection The numbers in angle brackets are examples of features listed in Table 4

Figure 6 Example of similar apps recommended by Google Play

In practice one can resort to app stores such as GooglePlay to find out candidates that will meet condition AFigure 6 shows that when viewing the details of an appGoogle Play will recommend the similar ones These similarapps are determined by several features such as category ofapps keywords in the title and description and size of the apk(httpswwwquoracomHow-does-the-Google-Play-Store-determine-similar-apps) Generally these recommendedapps belong to the same category and have similar function-ality Therefore in order to determine the peer group of anapp we first use the similar apps recommended by app storessuch as Google Play as candidates and later filter them byconditionB

For condition B the Euclidean distance is used tomeasure the similarity of network behavior profiles Suppose119860 is the app to be analyzed and 119861 is the app satisfying con-dition A for 119860 The similarity of network behavior profilesbetween 119860 and 119861 is calculated as

119889 (119860 119861)

= radic(119899119861119875 (119860) minus 119899119861119875 (119861)) (119899119861119875 (119860) minus 119899119861119875 (119861))119879(8)

where 119899119861119875 is the abbreviation of 119899119890119905119861119890ℎ119886V119894119900119903119875119903119900119891119894119897119890 definedin (1)

Once the similarities between 119860 and the recommendedapps are calculated by (8) the results are sorted in the orderof increasing distance (ie decreasing similarity) The first119892 apps will be selected as the peer group members of 119860 asshown in (9) The optimal value of 119892 can be determined bythe experiments which is discussed in the next section Notethat the peer group members can be updated For example ifone of the peer groupmembers has been flagged asmaliciousit will be removed from the peer group and a new one will beadded AppFA can also reselect the peer groups every 119879 timeinterval

peerGroup (119860) fl app1 app119892

where 119889 (119860 app119894) ge 119889 (119860 app119895) if 119894 lt 119895(9)

As illustrated in Figures 2 and 5 for an identified appAppFA first constructs its network behavior profile and then

Security and Communication Networks 9

compares the profile with both its historical profiles andthe profiles of its peer groups for malicious apps detectionDenote 119900 as the profile of the identified app 119860 and 119866 as thematrix of the compared profiles (historical profiles or peergroup profiles) Each column of 119866 is one profile and thecolumn length is 119897 (119897 = (1 + 2 lowast (119891119900 + 119891119894) + 2 lowast (1198961 + 1198962)))AppFAmakes the feature vectors the same length by paddingzeros

For peer group analysis firstly 119900 and 119866 are normalized bymin-max normalization procedure defined in

119900119894 =119900119894

max (119900119894 1198661198941 1198661198942 119866119894119892)

119866119894119895 =119866119894119895

max (119900119894 1198661198941 1198661198942 119866119894119892)

119894 = 1 2 119897 119895 = 1 2 119892

(10)

Then the distance between 119900 and 119866 is calculated byMahalanobis distance

120591 = radic(119900 minus 119898)119879 119878minus1 (119900 minus 119898) (11)

where 119898 is the weighted mean vector of profiles

1198981198941 =119892

sum119895=1

119908119860119895119866119894119895 1 le 119894 le 119897 (12)

and 119878 is the covariance matrix and it is calculated in

119878 = 119864 (119866 minus 119864 [119866]) (119866 minus 119864 [119866])119879 (13)

In (12) 119908119860119895 is the weight of the 119895th closest peer groupmember of the analyzed app 119860 The weights of peer groupmembers are obtained from their proximity to 119860 In detailwe define the proximity of the 119895th closest peer groupmemberof the target app 119860 in

prox119860119895 = exp (minus119889 (119860 119895)) (14)

where 119889(119860 119895) is the Euclidean distance between app 119860 and its119895th closest peer group member defined in (8) Based on theproximity measure we defined above the weight of the 119895thclosest peer group member of app 119860 is defined in

119908119860119895 =prox119860119895

sum119892119895=1 prox119860119895(15)

Finally the state of apps is judged as follows

120591 gt 119905119904 malicious app

120591 le 119905119904 normal app(16)

where 119905119904 is the threshold and it can be set to different valuesby network administrators based on the actual network con-dition We evaluated different values of 119905119904 in our experiment

Internet

ServerUbuntu 1604

functions

Testing mobiledevices(HUAWEI andMI Phones)

(i) Access point(ii) Traffic collection

(iii) Data analysis

Samples(apps)

Figure 7 Experimental setup for the detection of mobile maliciousapps

6 Evaluation

61 Data Collection

Experimental Setup The experimental setup used for AppFAis shown in Figure 7 We have implemented a prototypeof AppFA with the help of ourmon [39] and CCCG [40]Ourmon is an open-source networkmonitoring and anomalydetection system and CCCG is a general framework forconstrained clustering A Ubuntu 1604 computer has beenconfigured as the access point For packet capturing thesmart phone is connected to the Internet by WIFI andnetwork traffic is collected at the access point by tcpdump(httpwwwtcpdumporg) Each pcap file is fixed up to100MB When packet capture is complete the TCP andUPD flows are split by SplitCap (httpswwwnetreseccompage=SplitCap) from pcap files For appsrsquo traffic clusteringand network behavior profile construction we use tshark(httpswwwwiresharkorgdocsman-pagestsharkhtml) toextract IP address packet sizes and packet interval timesfrom network flows After that a Python program is writtento calculate the statistical features as listed in Tables 3 and 4The KPCA transformation is accomplished with the help ofscikit-learn (httpscikit-learnorg)The data analysis is alsocompleted in theUbuntu 1604 computer (with 4GBmemoryand Pentium Dual-Core CPU T4500)

TrafficGenerationWe use the publicMalGenome [19] datasetin our experiments Since we mainly focus on repackag-ing and updating malwares 93 typical information collectionmalwares including repackaging and updating attack typesare selected from MalGenome to test the detection rate ofAppFA (the malicious apps dataset was downloaded fromhttpwwwmalgenomeprojectorg in 2015 These selectedmalwares are the ones run without any errors) We haveinstalled all these malicious apps in HUAWEI Honor 8 and aMI Note 4 phones and run these apps one by one 50 times tocollect network traffic In detail these malwares are analyzedand run by GroddDroid [41] to make sure that maliciouscodes will be triggered We also use GroddDroid to runother malwares besides the selected ones and their trafficwill be mainly used in the local detection as described in

10 Security and Communication Networks

Section 63 Particularly the background traffic includingweather forecasting email checking QQ and Webchat (QQand Webchat are the most famous apps in China) is allowedin our experiments The ground truth of the originator ofnetwork traffic is determined by Packet Capture (httpsplaygooglecomstoreappsdetailsid=appgreyshirtssslcaptureamphl=zh) to test the accuracy of app traffic clustering

For testing the false positive rates of AppFA we rely onthe GooglePlayAppsCrawlerpy project to identify the 100most popular free apps These 100 free apps have also beenrun one by one 50 times While different from malwaresthe benign apps are run by droidbot [42] to make sure togenerate enough traffic With these captured traffic we takeup all nonzero packets for app traffic clustering and networkbehavior profile construction Namely in our experimentsthe value of 119899 in Packet filter model is set to minus1

The data used in the experiments is summarized inTable 5 TrafficSet 1 consists of network traffic generated bythe selected 93 malwares and TrafficSet 2 consists of networktraffic of benign apps TrafficSet 3 is the traffic of all malwaresthus it includes TrafficSet 1 TrafficSet 3 is mainly used formalicious apps detection in local networks

62 Experimental Results

Accuracy of Traffic Clustering TrafficSet 1 is used for testingthe accuracy of app traffic clustering and Packet Capture isexploited to get the ground truth of the originator of networktraffic Recall that in Algorithm 1 app traffic clustering mustbe started with signature seeds In order to obtain appsrsquosignature seeds we randomly choose several HTTP flowsfor each app and extract the key-value pairs in their HTTPheaders as signature seeds as described in Section 4 Thevalue of cluster number 119901 in Algorithm 2 is set to 97 (97 =93 + 4) since there are 93 malwares and 4 background appsthat is weather forecasting email checking QQ andWebchatThe experimental results are shown in Table 6

As shown in Table 6 for each app when 5 HTTP flowsare chosen to generate signature seeds only 601 flows (in all223220 flows) aremisclusteredThis proves the efficiencies ofour proposed method for app traffic clustering

Experimental Results of Malicious App Detection Similarto previous work in order to measure the effectiveness ofmalicious apps detection accuracy (detection rate) and falsepositive rate metrics are defined in (17) and (18) where TPFN FP and TN stand for true positive false negative falsepositive and true negative respectively

Accuracy = TPTP + FN

(17)

False positive rate = FPFP + TN

(18)

We first examine the detection rates and false positiverates of AppFA with different values of 119905119904 and 119896119894 (119894 = 1 2)For the selected malicious apps the apps with the samefunctionality (satisfying condition A) are determined byGoogle Play We first analyze the malicious appsrsquo function-alities manually and choose a keyword for each app Then

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

05

06

07

08

09

1

Det

ectio

n ra

te

06 08 1 12 14 16 18 2 04ts

Figure 8 Detection rates of AppFA with different values of 119905119904 and119896119894 (119894 = 1 2)

we search the keyword in Google Play and use the returnedapps as candidates After that the peer group is determined by(8) and (9) In our experiments the largest value of membercount 119892 is set to 10 Note that 119905119904 is the threshold defined in(16) and 119896119894 is the number of KPCA transformations definedin Table 4 We set the member count 119892 in peer group analysisto 5 namely for each tested app its top 5 similar apps areselected and analyzedThedetection rate is shown in Figure 8and the false positive rate is depicted in Figure 9As illustratedin Figures 8 and 9 the first 4 KPCA transformations of packetsizes and intervals are good enough for constructing networkbehavior profileWhen 1198961 = 1198962 = 4 and 119905119904 gt 12 the detectionrate is higher than 90 and the false positive rate is lowerthan 04 The detection rate is as high as 97 when 119905119904 = 2The experimental results demonstrate the effectiveness of ourproposed approach

We then test AppFA with different value of 119892 The valuesof 119905119904 and 119896119894 (119894 = 1 2) are set to 2 and 4 respectivelyThe experimental results are shown in Figures 10 and 11 Asillustrated in these figures when 119892 increases the detectionrates and the false positive rates are slightly changed Thiscan be explained as the Mahalanobis distance 120591 defined in(11) is calculated by considering the weighted mean vector ofprofiles So the most similar app has the largest weight andthe added peers will have smaller weights and thus providelittle contribution to the final detection As observed in ourexperiments 119892 = 5 is suitable for practical deployment

Further AppFA with different value of cluster number 119901is also tested Note that the actual cluster number is 97 forTrafficSet 1 In order to evaluate how app traffic clusteringimpacts the detection of repackaged malware we slightlychange the value of 119901 and the experiments results are asshown in Figure 12 In the experiments we set 1198961 = 1198962 = 4119905119904 = 2 and 119892 = 5 As shown in Figure 12 the cluster numberindeed impacts the detection rate of AppFA Particularlywhen 119901 = 99 the detection rate is only 88 namely only 82

Security and Communication Networks 11

Table 5 Data summary

Traffic set Apps Quantity Number of collectedflows (TCP and UPD) Amount of data (bytes)

TrafficSet 1 Selected malicious apps(for testing accuracy of traffic clustering and detection rates) 93 223220 538G

TrafficSet 2 Benign apps(for testing false positive rates of AppFA) 100 305103 640G

TrafficSet 3 All malicious apps(for malicious apps detection in local networks) 1260 asymp31 million 1872G

Table 6 Experimental results of traffic clustering

Number of HTTP flows forgenerating signature seeds

Number of flowscorrectly clustered

Number ofmisclustered

flows2 201357 218633 218946 42744 221384 18365 222619 601

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

00005

0001

00015

0002

00025

0003

00035

0004

False

pos

itive

rate

06 08 1 12 14 16 18 2 04ts

Figure 9 False positive rates of AppFA with different values of 119905119904and 119896119894(119894 = 1 2)

malicious apps are correctly detected However when 119901 lt 97(119901 = 95 and 96 in Figure 12) the detection rates are almost thesame namely 97Therefore AppFAwill be competent witha few unknown apps when carrying out app traffic clustering

Finally we evaluate AppFA with the remaining maliciousapps besides the selected 93 samplesThe rest of themaliciousapps contain repackaged and other types of malwares Thesemalwares are found to have some errors when run on ourphones However they also produced considerable networktraffic We test AppFA with these apps for examining thegeneral capability of AppFA The parameters are set up asfollows 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 The experimentalresults are shown in Table 7 The detection rate is 734and the false positive rate is 35 Compared to the results

1 3 5 7 10g

Detection rate

0

01

02

03

04

05

06

07

08

09

1

rate

()

Figure 10 Detection rates of AppFA with different values of 119892

1 3 5 7 10g

False positive rate

times10minus3

0

1

2

3

4

5

6

7

rate

()

Figure 11 False positive rates of AppFA with different values of 119892

displayed in Figures 8 9 10 and 11 the detection rate issignificantly reduced and the false positive rate is increasedThe possible reasons will be discussed in Section 64

Note that AppFA is designed to perform nearly real-timemalicious apps detection thus efficiency is also a big concernTable 8 presents the computational performance for major

12 Security and Communication Networks

99989695value of p

0

01

02

03

04

05

06

07

08

09

1

dete

ctio

n ra

te (

)

Figure 12 Detection rates of AppFA with different values of 119901

Table 7 Experimental results for the malicious apps besides theselected 93 samples

Detection rate False positive rate734 35

Table 8 The computational performance

Procedures TimeConstrained flow clustering (Algorithm 1) 3minApp identification (Algorithm 2) 52minNetwork behavior profile construction 17minPeer group analysis 45 sec

steps in terms of average running time using the experimentaldata Generally the proposedmethod can be deployed onlineto detect malicious apps Specifically it is fast typically lessthan one minute to exploit peer group analysis to performmalicious app detection The app identification procedureis the most time-consuming step because the signaturematching and constrained flow clustering are carried out oneach flow Parallel computing such as cloud resources can befurther used to speed the malicious apps detection and scaleup to more Internet traffic data

Detection of Malicious Repackaged Apps for Encrypted TrafficAs shown in Table 4 AppFA does not consider traffic contentand mainly uses traffic statistical features to detect maliciousAndroid apps Thus AppFA can deal with encrypted trafficSince in MalGenome there is no app whose all connectionsto attack servers are encrypted we have created a newAndroid repackaged malware that communicates with allmalicious servers by encrypted connections In detail wefirst modify AnserverBotrsquos source code and make sure itsnetwork connectionswill be encrypted byTLS protocolThenwe graft it into a popular game app aircomaceviralmotox3Finally we collect network traffic of the repackaged aircomaceviralmotox3 and its peer group In the experiments

Table 9 Experimental results when searching similar apps locally

Detection rate False positive rate692 47

the selected peer group of the repackaged aircomaceviralmotox3 is comwordmobilesbikeRacing comtopfreegamesbikeracefreeworld comskgamestrafficrider comtomicowheel-iechallenge and madmotoracingbike We repeat to detectthe repackaged aircomaceviralmotox3 10 times and thedetection rate is 100 The results show that AppFA canhandle encrypted network traffic

63 Detection in Local Networks In the above experimentsthe peer group for each app is determinedwith the help of appstore such as theGoogle PlayThatmeansAppFAhas to accessthe Internet when performing malicious apps detection Infact AppFA can also be enhanced to choose the appsrsquo peergroups locally namely from the set of already identified appsNote that AppFA uses signature matching to identify appson the network and the identified ones can be treated as amini app store By this AppFA can work on local networksTo find out similar apps from the already identified ones anefficient method based on information retrieval technologiesis proposed and is shown in Algorithm 3

The basic idea of Algorithm 3 is that similar apps mayhave similar functionalities and the context (keywords) maybe similarTherefore we useweb-page searching technologiesto match similar app The 119872119886119905119888ℎ119894119899119892 function in line (4) canbe realized by TF-IDF (Term Frequency-Inverse DocumentFrequency) [43] Again peer groups can be chose by (8) withthe results returned by Algorithm 3

We have implemented the local similar app searching andevaluated AppFArsquos performance in local networks All appsincluding all malicious and benign apps are considered asidentified apps and Algorithm 3 is carried out to find outsimilar apps When 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 for the93 typical repackaged malwares the experimental results arelisted in Table 9 The detection rate is 692 lower than 97in Figure 8The experimental results indicate that AppFA canwork in local networks and the proper peer groups do affectthe final detection significantly

64 Discussion Figures 8 9 10 and 11 show that AppFAgets high detection rates and low false positive rates whendetecting repackaged apps However the detection rate maybecome lower when it is applied to other types of maliciousapps such as sending SMS without notification This may bedue to the fact that other types of mobile malwares transmitless data through network than repackaged apps Note thatthe detection features used inAppFA aremainly related to thepackets sizes and the number of flows So the detection ratemay be declined if the appsrsquo network behaviors are slightlychanged

Compared to the latest work done in [6] AppFA has aslightly lower detection rate (the detection rate is 95ndash999 asreported in [6]) However Garg et alrsquos method [6] consideredonly 18 malware apps and 14 genuine apps and assumed

Security and Communication Networks 13

(1) extract all plaintext keywords from HTTP headers of thetested app 119860 Denote the set of keywords as 119882(119860)

(2) extract all plaintext keywords from HTTP headers of otheridentified apps Denote the set of keywords of identifiedapp 119894 as 119882(119894) 119894 = 0 1 119899 minus 1

(3) while 0 le 119899 do(4) 119878119888119900119903119890119904[119894]V119886119897119906119890 = 119872119886119905119888ℎ119894119899119892(119882(119860) 119882(119894))(5) 119878119888119900119903119890119904[119894]119886119901119901 = 119894(6) 119894 + +(7) end while(8) sorting 119878119888119900119903119890119904[] with 119878119888119900119903119890119904[119894]V119886119897119906119890 by decreasing order(9) return the first 119898119878119888119900119903119890119904[119894]119886119901119901 119894 = 0 1 119898 minus 1

Algorithm 3 Searching similar apps locally

Table 10 Experimental results when the fact that what traffic isgenerated by what app is exactly known In the experiments 1198961 =1198962 = 4 119905119904 = 2 and 119892 = 5

Selected maliciousapps

Remaining maliciousapps

Detection rate 983 85False positiverate 003 27

that the traffic generated by apps was already known (theirdata were collected on the mobile device so their methodis actually a kind of client-side approach) In our workmuch larger samples are taken into account and constrainedclustering is exploited to determine to what app the flowsbelong The accuracy of clustering is also a factor affectingthe detection rate To confirm this we further take the sameassumption as Garg et alrsquos method [6] namely we assumethe accuracy of Algorithm 1 is 100 and know exactly whattraffic is generated by what app (in our experiments we useapp Packet Capture to obtain perfect ground truth of whatflows came from what app) At this time the experimentalresults are given in Table 10 Apparently the detection ratesare much improved

In practice one can deploy both our method and othermethods such as [6 15] simultaneously As indicated by ourexperiments AppFA is efficient in detecting repackaged andself-updatingmalicious apps In fact [15] is suitable forHTTPanalysis and [6] can detect CampC communication efficientlySo our work can be a complement of existing work onmalicious app detection

In this work AppFA mainly uses the statistical features(refer to Table 4) of network traffic to detect Android mali-cious repackaged applications Therefore attackers (repack-aged apps) may change the characteristics of their traffic toremain undetectable But it is quite difficult in practice Asillustrated in Section 2 repackaged apps usually introduceadditional network traffic thus the attackers must removesome normal network connections to keep network behav-iors the same to evade detection However the removal of

normal network connections will impact the functionalitiesof apps and may cause errors This gives additional clues todetect repackaged malware Meanwhile the disappearanceof some network connections may be abnormal as wellTherefore our proposed method is hard to evade

7 Conclusion

In this paper we propose a novel approach AppFA to detectmalicious apps at the network level In AppFA apps are firstidentified from network traffic by signature matching andconstrained clustering Then Kernel Principal ComponentAnalysis is employed to construct app network behavior pro-file and distinguish minor traffic variations from significantdifferences At last we take advantage of peer group analysisto detect malicious apps to avoid time-consuming offlinemodel training Notably AppFA does not need to installprograms or modify operating systems to collect featureinformation Thus it is very convenient to be used and thecost is low The experimental results show that AppFA candetect Android repackaged malware with the detection ratehigher than 90 and a false positive rate lower than 04

The appsrsquo network behaviorsmay be significantly changedby version update and thus cause false positives This needsto be further investigated In the future work AppFA willbe extended to include more network traffic features anddetect more types of malicious apps We will also makeAppFA an open-source project to facilitate further researchesof malicious app detection

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by National Natural Science Founda-tion of China under Grants 61702282 and 61502250 NaturalScience Foundation of the Jiangsu Higher Education Insti-tutions of China under Grant 17KJB520023 NUPTSF underGrant NY217143 and Nanjing Forestry University (GXL016CX2016026)

14 Security and Communication Networks

References

[1] Android statistics Number of android applications September2017 httpwwwappbraincomstatsnumber-of-android-apps

[2] A Mylonas M Theoharidou and D Gritzalis ldquoAssessingPrivacy Risks in Android A User-Centric Approachrdquo in RiskAssessment and Risk-Driven Testing Lecture Notes in ComputerScience pp 21ndash37 Springer International Publishing Cham2014

[3] Does Does your mobile anti-virus app protect or infect you thetruth behind du antivirus security September 2017 httpsresearchcheckpointcommobile-anti-virus-app-protect-infect-truth-behind-du-antivirus-security

[4] A P Felt M Finifter E Chin S Hanna and D Wagner ldquoAsurvey of mobile malware in the wildrdquo in Proceedings of the 1stACM Workshop on Security and Privacy in Smartphones andMobile Devices (SPSM rsquo11) held in Association with the 18th ACMConference on Computer and Communications Security (CCSrsquo11) pp 3ndash14 October 2011

[5] K Tam A Feizollah N B Anuar R Salleh and L Caval-laro ldquoThe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys vol 49 no 4 Article ID3017427 2017

[6] S Garg S K Peddoju and A K Sarje ldquoNetwork-based detec-tion of Android malicious appsrdquo International Journal of Infor-mation Security pp 1ndash16 2016

[7] A Shabtai L Tenenboim-Chekina D Mimran L Rokach BShapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquo Com-puters amp Security vol 43 pp 1ndash18 2014

[8] L Xie X Zhang J-P Seifert and S Zhu ldquoPBMDS a behavior-based malware detection system for cellphone devicesrdquo inProceedings of the 3rd ACM Conference on Wireless NetworkSecurity (WiSec rsquo10) pp 37ndash48 Hoboken NJ USAMarch 2010

[9] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacy inSmartphones and Mobile Devices (SPSM rsquo11) Held in Associationwith the 18th ACM Conference on Computer and Communica-tions Security (CCS rsquo11) pp 15ndash25 October 2011

[10] GDini FMartinelli A Saracino andD Sgandurra ldquoMADAMa multi-level anomaly detector for android malwarerdquo in Com-puter Network Security vol 7531 of Lecture Notes in ComputerScience pp 240ndash253 Springer Berlin Germany 2012

[11] Y Zhang M Yang B Xu et al ldquoVetting undesirable behaviorsin android apps with permission use analysisrdquo in Proceedingsof the in Proceedings of the 2013 ACM SIGSAC conference onComputer amp communications security pp 611ndash622 2013

[12] S Wang Z Chen X Li L Wang K Ji and C Zhao ldquoAndroidMalware Clustering Analysis on Network-Level Behaviorrdquo inIntelligent Computing Theories and Application vol 10361 ofLecture Notes in Computer Science pp 796ndash807 SpringerInternational Publishing Cham Switzerland 2017

[13] A S Shamili C Bauckhage and T Alpcan ldquoMalware detectionon mobile devices using distributed machine learningrdquo inProceedings of the 20th International Conference on PatternRecognition (ICPR rsquo10) pp 4348ndash4351 August 2010

[14] Y Wang J Wei and K Vangury ldquoBring your own device secu-rity issues and challengesrdquo in Proceedings of the 2014 IEEE 11thConsumer Communications and Networking Conference CCNC2014 pp 80ndash85 USA January 2014

[15] P S Chen S-C Lin and C-H Sun ldquoSimple and effectivemethod for detecting abnormal internet behaviors of mobiledevicesrdquo Information Sciences vol 321 Article ID 11536 pp 193ndash204 2015

[16] Global mobile os market share in sales to end users from 1stquarter 2009 to 1st quarter 2017 2017 httpswwwstatistacomstatistics266136global-market-share-held-by-smartphone-op-eratingsystems

[17] ldquoIt threat evolution q3 2017 statisticsrdquo November 2017 httpssecurelistcomit-threat-evolution-q3-2017-statistics83131

[18] L Li D Li T F Bissyande et al ldquoUnderstanding Android AppPiggybacking A Systematic Study of Malicious Code GraftingrdquoIEEE Transactions on Information Forensics and Security vol 12no 6 pp 1269ndash1284 2017

[19] Y Zhou and X Jiang ldquoDissecting android malware characteri-zation and evolutionrdquo in Proceedings of the 33rd IEEE Sympo-sium on Security and Privacy pp 95ndash109 San Francisco CalifUSA May 2012

[20] P Faruki V Laxmi A Bharmal M S Gaur and V Gan-moor ldquoAndroSimilar Robust signature for detecting variantsof Android malwarerdquo Journal of Information Security andApplications vol 22 pp 66ndash80 2015

[21] Y Zhou and X JiangAn analysis of the anserverbot trojan 2011httpwwwcscncsuedufacultyjiangpubsAnserverBot Ana-lysispdf

[22] P Faruki A Bharmal V Laxmi et al ldquoAndroid security a sur-vey of issues malware penetration and defensesrdquo IEEE Com-munications Surveys amp Tutorials vol 17 no 2 pp 998ndash10222015

[23] P Yan and Z Yan ldquoA survey on dynamic mobile malwaredetectionrdquo Software Quality Journal pp 1ndash29 2017

[24] S Dai Y Liu T Wang T Wei and W Zou ldquoBehavior-basedmalware detection on mobile phonerdquo in Proceedings of the 20106th International Conference on Wireless Communications Net-working and Mobile Computing WiCOM 2010 China Septem-ber 2010

[25] D Damopoulos G Kambourakis S Gritzalis and S O ParkldquoExposing mobile malware from the inside (or what is yourmobile app really doing)rdquo Peer-to-Peer Networking and Appli-cations vol 7 no 4 pp 687ndash697 2014

[26] D Damopoulos G Kambourakis and G Portokalidis ldquoThebest of both worlds A framework for the synergistic operationof host and cloud anomaly-based IDS for smartphonesrdquo inProceedings of the 7th European Workshop on System SecurityEuroSec 2014 Netherlands April 2014

[27] Q Xu J Erman A Gerber Z Mao J Pang and S Venkatara-man ldquoIdentifying diverse usage behaviors of smartphone appsrdquoin Proceedings of the ACM SIGCOMM Internet MeasurementConference (IMC rsquo11) pp 329ndash344 November 2011

[28] S Dai A Tongaonkar X Wang A Nucci and D Song ldquoNet-workProfiler Towards automatic fingerprinting of Androidappsrdquo in Proceedings of the 32nd IEEE Conference on ComputerCommunications IEEE INFOCOM 2013 pp 809ndash817 ItalyApril 2013

[29] Q Xu T Andrews Y Liao et al ldquoFlowr A self-learning systemfor classifying mobile application trafficrdquo in Proceedings of the2014 ACM SIGMETRICS International Conference on Measure-ment and Modeling of Computer Systems SIGMETRICS 2014pp 569-570 usa June 2014

[30] Q Xu Y Liao S Miskovic et al ldquoAutomatic generation ofmobile app signatures from traffic observationsrdquo in Proceedings

Security and Communication Networks 15

of the 34th IEEE Annual Conference on Computer Communica-tions and Networks IEEE INFOCOM 2015 pp 1481ndash1489 HongKong May 2015

[31] J Sun L She H Chen et al ldquoAutomatically identifying appsin mobile trafficrdquo Concurrency Computation vol 28 no 14 pp3927ndash3941 2016

[32] G He B Xu and H Zhu ldquoIdentifying Mobile Applications forEncrypted Network Trafficrdquo in Proceedings of the 5th Interna-tional Conference on Advanced Cloud and Big Data CBD 2017pp 279ndash284 China August 2017

[33] G He M Yang X Gu J Luo and Y Ma ldquoA novel activewebsite fingerprinting attack against Tor anonymous systemrdquoin Proceedings of the 2014 18th IEEE International Conference onComputer Supported CooperativeWork inDesign CSCWD2014pp 112ndash117 Taiwan May 2014

[34] G He M Yang J Luo and X Gu ldquoA novel application classi-fication attack against Torrdquo Concurrency and ComputationPractice and Experience vol 27 no 18 pp 5640ndash5661 2015

[35] B Miller L Huang A D Joseph and J D Tygar ldquoI know whyyou went to the clinic Risks and realization of HTTPS trafficanalysisrdquo Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes in Bio-informatics) Preface vol 8555 pp 143ndash163 2014

[36] Y Wang Y Xiang J Zhang W Zhou G Wei and L T YangldquoInternet traffic classification using constrained clusteringrdquoIEEE Transactions on Parallel and Distributed Systems vol 25no 11 pp 2932ndash2943 2014

[37] L J Cao K S Chua W K Chong H P Lee and Q M Gu ldquoAcomparison of PCA KPCA and ICA for dimensionality reduc-tion in support vector machinerdquo Neurocomputing vol 55 no1-2 pp 321ndash336 2003

[38] Y Kim and S Y Sohn ldquoStock fraud detection using peer groupanalysisrdquo Expert Systems with Applications vol 39 no 10 pp8986ndash8992 2012

[39] ldquoOurmon network monitoring and anomaly detection systemrdquoApril 2013 httpourmonsourceforgenet

[40] B Babaki T Guns and S Nijssen ldquoConstrained ClusteringUsing Column Generationrdquo in Integration of AI and OR Tech-niques in Constraint Programming vol 8451 of Lecture Notes inComputer Science pp 438ndash454 Springer International Publish-ing Cham 2014

[41] A Abraham R Andriatsimandefitra A Brunelat J-F Lalandeand V Viet Triem Tong ldquoGroddDroid A gorilla for triggeringmalicious behaviorsrdquo in Proceedings of the 10th InternationalConference on Malicious and Unwanted Software MALWARE2015 pp 119ndash127 USA October 2015

[42] Y Li Z Yang Y Guo and X Chen ldquoDroidBot A lightweightUI-guided test input generator for androidrdquo inProceedings of the39th IEEEACM International Conference on Software Engineer-ing Companion ICSE-C 2017 pp 23ndash26 Argentina May 2017

[43] A Aizawa ldquoAn information-theoretic perspective of tf-idfmeasuresrdquo Information Processing amp Management vol 39 no1 pp 45ndash65 2003

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 2: AppFA: A Novel Approach to Detect Malicious …downloads.hindawi.com/journals/scn/2018/2854728.pdfSecurityandCommunicationNetworks F :Exampleofkey-valuepairs. signs, and the packet

2 Security and Communication Networks

As illustrated in literature [14] it is hard to ensure that allmobile devices have installed information collection pro-grams and it is impractical tomanually audit every employeersquospersonal device due to the privacy issue and also the largeamount of mobile devices

In this paper we propose AppFA (App Flow Analy-sis) a novel approach to detect malicious Android appsfrom network traffic Contrary to client-side and server-sidemethods AppFA is implemented at the network level andthus it is a new kind of network-side detection approachNotably AppFA does not need to install programs or modifyoperating systems to extract detection features therefore it islightweight and easy to deploy We notice that literature [15]is also a kind of network-side approach while it only analyzesHTTP traffic and cannot be applied to encrypted networktraffic Previous works [6 7] have also proposed methods todetect mobile malware from network traffic These methodsneed offline training and install programs (such as VPNproxy) on mobile devices to get to know exactly what flowscome fromwhat app for traffic labellingThus they still belongto client-side or server-side approaches

In this work AppFA can analyze encrypted networktraffic through network behavior profile construction Theapps traffic is clustered by constrained clustering thus we donot need to install programs on mobile device to determinethe origin of apps traffic We also use peer group analysis toavoid offline model training Our main contributions are thefollowing

(i) Providing a lightweight and efficient framework fordetecting malicious Android apps on the network

(ii) Proposing an efficient algorithm for clusteringmobileapps network traffic

(iii) Outlining a method for detecting malicious apps byconstructing network behavior profile and using peergroup analysis

(iv) Carrying out extensive experiments with the publicdataset

The rest of this work is structured as follows Themotivation of our work is presented in Section 2 In Sec-tion 3 we discuss relevant related work in detail Section 4introduces the architecture and main components of AppFAThe methodology is presented in Section 5 Experimentalevaluation and discussion of the proposed methods arepresented in Section 6 Finally in Section 7 we conclude thepaper with a discussion of potential future work

2 Motivation and Observations

As investigated by Statista Android accounts for morethan 86 of the global mobile OS market until the 1stquarter of 2017 [16] The popularity of Android devicesmakes it a desirable target For example the top 20 mobilemalware programs are all related to Android [17] One ofthe reasons for the popularity of Android malware may bethat Android app package elements can easily be modi-fied by third parties [18] With open-source tools such as

apktool (httpsgithubcomiBotPeachesApktool) and jadx(httpsgithubcomskylotjadx) malware writers can easilygraft some malicious code on popular apps to ensure awide diffusion of their malicious code As an evidenceMalGenome [19] a reference dataset in the Android securitycommunity and also used in our experiments has 80 of themalicious samples known to be built via repackaging otherapps Therefore in this work we mainly focus on detectionof Android malicious repackaged apps

It is popular to detect Android malware by code andresource file analysis [20] while there are very few studiesconsideringmalicious Android apps detection at the networklevel In order to detect Android malware on the networkwe have analyzed network traffic of Android apps carefullyand obtained several observations The first observationis that network behaviors of repackaged apps are signifi-cantly different from those of their original versions Thisobservation is also validated by Shabtai et al [7] Takingthe Android malware AnserverBot [21] as an example net-work behaviors of the repackaged app and the original app(comcamelgamesmxmotor) are compared in Table 1

For comparison the two apps were run in a real phonerespectively and the network traffic was collected in the first5 minutes After that the total packet sizes and the amountof tcp connections were calculated as network behaviorsObviously as shown in Table 1 there is a clear differencebetween the network behavior of the repackaged and originalapps particularly the total packet sizes of the repackaged appare significantly larger than the original app (23Mb versus1932 kb)

We further compared appsrsquo network behaviors with theirsimilar apps Again taking comcamelgamesmxmotor as theexample we selected its 3 similar apps from the Google Playstore The chosen strategies are described as flows we firstsearched the key word ldquomotordquo in the Google Play store thenwe selected the top 3 game apps from the search results as thesimilar ones The network behaviors of these apps are com-pared as in Table 2 Apparently their network behaviors areclose to each other This phenomenon of similar apps havingsimilar network behaviors is also validated in [7] These pre-liminary observations of network behaviors of Android appsgive us clues for detection of Android repackaged malware

We also observed that more and more Android appsadopt encrypted network connections to transmit databetween smart devices and remote servers For instance combaiduBaiduMap generates several SSL flows at the startupstage as shown in Figure 1 Therefore methods for detectingAndroid repackaged malware should handle encrypted net-work traffic

Based on the observations above we propose to detectmalicious Android apps by comparing appsrsquo network behav-iors with their historical data and the ones of their similarapps Comparison with the historical data can help us findout self-updating malicious apps and comparison with thebehaviors of similar apps can detect other types of repackagedmalware Self-updating is a new technique for repackagingapps and it cannot be detected by applying regular static ordynamic analysis methods [7] The details of network behav-ior construction and similar app selection are illustrated

Security and Communication Networks 3

Figure 1 Example of network flows of Android app combaiduBaiduMap The traffic is captured and analyzed by NetworkminerNote that there exists several encrypted SSL flows and the numberis more than HTTP traffic

Table 1 Comparison of network behaviors of original appcomcamelgamesmxmotor and its repackaged version (repackagedby AnserverBot)

Total packet sizes Number of tcp connectionsRepackaged app 23Mb 18Original app 1932 kb 11

Table 2 Comparison of network behaviors of similar apps

Total packetsizes

Number of tcpconnections

comcamelgamesmxmotor 1932 kb 11aircomaceviralmotox3m 1828 kb 14comskgamestrafficrider 1751 kb 13comtopfreegamesbikeracefreeworld 1327 kb 9

in Section 5 Meanwhile a novel constrained clusteringalgorithm is elaborated for app traffic clustering Thus ourmethod can be applied on the network straightforwardly anddoes not need to install programs onmobile devices to collectflow information

3 Related Work

There has been extensive work on detectingmaliciousmobileapps Literature [4 5 22 23] gave surveys of mobile malwarein the wild and the proposed techniques for detecting themIn this section we mainly focus on behavior-based malwaredetection methods and only review the most related ones

Generally current behavior-basedmobilemalware detec-tion approaches can be categorized into two main groupsclient-side and server-side detection Client-side detectionapproaches run locally and apply anomalymethods on the setof features which indicate the state of the appThe pBMDS [8]is based on correlating user inputs with system calls to detectanomalous activities A Hidden Markov Model (HMM)is used to learn application and user behaviors from twomajor aspects process state transitions and user operationalpatterns Built upon these two aspects the pBMDS identifiesbehavioral differences between user initiated applicationsand malware compromised ones Zhang et al [11] combineddynamic tracing of the permission requests for resources

usage by applications with tracking sensitive operations onthe granted resources (using taint tracking) This combina-tion enabled them to understand how applications utilize thepermissions to access sensitive system resources Dai et al[24] presented a malware detection system for the WindowsMobile platform They used API interception techniquesfor monitoring and analyzing the applicationrsquos behavior andcompared it to the patterns within the predefined library ofmalicious behavior characteristics Shabtai et al [7] presenteda behavior-based anomaly detection system for detectingmeaningful deviations in a mobile applicationrsquos networkbehavior Semisupervised C45 Decision Tree algorithm wasused for learning the normal behavioral patterns and fordetecting deviations from the applicationrsquos expected behaviorTheir methods were implemented and evaluated on Androiddevices Damopoulos et al [25] proposed a fully fledgedtool able to dynamically analyze any iOS software in termsof method invocation that can be used to trace softwarersquosbehavior to decide if it contains malicious code or not

Server-side detection approaches are carried out onremote servers mainly motivated by limited computationalresources of the mobile device [26] Burguera et al [9] havedeveloped an Android framework named ldquoCrowdroidrdquo thatincludes a client application installed on the deviceThe appli-cationmonitors Linux kernel system calls and sends them to acentralized server after preprocessingOn the server a datasetis built from the list of the system calls the list of runningapplications and the device information 119870-Means algorithmis then used for clustering the applications into two groupsthat is benign and malware applications Shamili et al [13]utilize a distributed Support Vector Machine algorithm formalware detection on a network ofmobile devicesThephonecalls SMSs and data communication related features areused for detection During the training phase support vectors(SVs) are learned locally on each device and then sent to theserver where SVs from all of the client devices are aggregatedFinally the server distributes the whole set of SVs to all theclients and each client updates his own SVs

Unlike the work mentioned above our proposed systemruns at the network level directly without necessarily havingaccess to the mobile devices We notice that the most similarwork was carried out by Chen et al [15] Chen et alrsquos method[15] was also implemented at the network level and identifiedabnormal network behaviors by conducting 3-step checkaction including (1) identifying HTTP POST and HTTPGET packages (2) checking whether the device was exposingunique device identifiers such as IMEI and IMSI and (3)determining the legitimacy of the remote server by queryingthe domain name server So their method can only be appliedto HTTP traffic Contrary to the work of Chen et al [15] weuse the combination of signature matching and constrainedmobile network traffic clustering for app identification andthus our method is suitable for both plaintext and ciphertexttraffic such as HTTPS The work introduced in [6] is alsointended to detect Android malware from network traffic InGarg et al [6] detection features were first extracted fromDNS HTTP and TCP traffic and then machine learningalgorithms (such as Decision Trees Bayesian Networks andRandom Forests) were used to detect malicious mobile apps

4 Security and Communication Networks

Packet Filter SessionBuilder

FlowSession

FlowSession

PacketContentExtractor

Basic FeatureExtractor

IdentificationFeatureGenerator

App TrafficClustering

ProfileFeatureGenerator

AppBehaviorProfileConstructor

MaliciousAppDetection

Appsession

Figure 2 Main components of AppFA

while their detection features were obtained in the mobiledevice and they needed to train classification models offlineIn AppFA we straightforwardly extract detection features onthe network to build appsrsquo network behavior profiles and usepeer group analysis to avoid offline model training

Recently there are an increasing number of researchworks that analyze network traffic to identify app such as[27ndash32] However most of them were focused on plaintextflows (eg HTTP) and tried to collect identification featuresfrom HTTP headers These methods may fail due to theemergence of encrypted network traffic In this paper weinvestigate a new constrained clusteringmethod to cluster thenetwork traffic generated by the same app

4 System Design

The system architecture of AppFA is shown in Figure 2 It isof modular design and each module has special functionsPacket Filtermodule captures link-layer frames or reads themfrom a file and filters them according to configurable rules Inorder to detect malicious apps online and provide support fornetwork management AppFA is designed to analyze the first119899 nonzero packets (packets contain application data) of eachnetwork flow The parameter 119899 is configurable for differentnetwork management purpose For example the value of 119899can be set to a small positive number such as 50 for real-timemalicious app detection and minus1 for full analysis namely con-sidering all nonzero packets in flows In AppFA the packetfilter module can be implemented based on well-knownlibrary such as libpcap (httpwwwtcpdumporgrelease)

Session Builder module organizes network traffic intosessions For app identification and malicious behavior

detectionwe define two types of sessionsflow session and appsession The flow session is defined by the source IP sourceport destination IP destination port transport protocoltuple where source and destination can be swapped andthe transport protocol is mainly considered as TCP andUDP in this work In deployment AppFA determines whena flow session is completed by one of the following threeconditions (1) received 119899 nonzero packets (2) detectingRESTFINpacket (3) timeout for example there is no packetexchanging in 3 minutes With flow sessions one can extractbasic features and packet contents efficiently for app identifi-cation and malicious behavior detection The app session isdefined as the collection set of all flow sessions gen-erated by the same app flow session 1 flow session 2 flow session 119898 The app sessions constructed by SessionBuilder module will be invoked by Profile Feature Generatormodule to construct app network behavior profiles asdepicted in the right of Figure 2

Basic Feature Extractor module extracts basic packetfeatures from flow sessions which include packet sizepacket interarrival time packet order and packet directionThe packet direction feature distinguishes outgoing fromincoming packets Other advanced features such as flowduration total packets sent and received and burst sizescommonly used in traffic analysis [33ndash35] can be calculatedfrom these basic features So we extract these basic featuresfirstly and then generate appropriate advanced features (iden-tification or detection features) for different traffic analysispurposes (clustering or peer group comparing) After thebasic feature extraction a flowmay look like +50 lowast30 +100lowast500 minus1300 lowast20 minus400 where outgoing packet sizes andincoming packet sizes are denoted by positive and negative

Security and Communication Networks 5

Figure 3 Example of key-value pairs

signs and the packet interarrival times are labelled byasterisk The packet order features are also reflected by thenumber sequences inherently For example for the flow+50 lowast30 +100 lowast500 minus1300 lowast20 minus400 the first packetis an outgoing packet whose size is 50 bytes and the secondpacket is also an outgoing packet and the packet time intervalis 30ms The third packet is an incoming packet whose sizeis 1300 bytes The time interval between the second and thethird packet is 500ms and so forth

Note that the packet contents are equally saved throughPacket Content Extractor module for app identification Sim-ilar to work [30] we focus on key-value pairs in HTTP head-ers In detail justniffer (httpjustniffersourceforgenet) isfirst used to transform the raw packet traffic into HTTPmessages Then HTTP messages are tokenized by severaltokenizers such as space ldquornrdquo ldquordquo and ldquoamprdquo and eachHTTPrequest will be broken into various parts including methodpage and query Finally queries are divided into key-valuepairs Figure 3 is an example of key-value pairs used in ourexperiments

After obtaining the basic features from flows we beginto generate identification features in Identification FeatureGeneratormodule After that wewill identify the app for eachflow through App Traffic Clustering module which returnsthe clustering results to Session Builder module for formingapp sessionsWith the app session information detection fea-tures will be created in Profile Feature Generator module Atlast app network behavior profiles will be constructed inAppBehavior Profile Constructor module and malicious apps willbe detected inMalicious App Detectionmodule respectivelyThe independent treatment of different functional modulescan make AppFA architecture clear and scalable

5 Methodology

As depicted in Figure 2 the functionality of app trafficclustering and malicious app detection (including identifica-tiondetection feature generation) are the core componentsof AppFA and the technical details are clarified in the nextsubsections

51 App Traffic Clustering The basic idea of our app trafficclustering method is illustrated in Figure 4 In the figurethere are two apps and the flow sessions of them are repre-sented by solid and dotted lines respectively For each flowsignature matching is first used to identify HTTP flows thathave recognized signatures Note that in this paper we usethe term signature for plaintext matching and feature for

traffic analysis With this step there are two flows identified(labelled as red and green) as shown in the left dashed box ofFigure 4 After the signaturematching constrained clusteringalgorithm (the second dashed box in Figure 4) is exploitedto cluster all flows Compared to the ordinary clusteringalgorithms constrained clustering algorithm adopts back-ground information (ie identified flows) to improve clusteraccuracy By constrained clustering flows such as encryptedtraffic that cannot be identified by signatures will be classifiedinto appropriate clusters that is apps Finally we obtainapp sessions that will be utilized for creating appsrsquo networkbehavior profiles as depicted in the last step in Figure 4

Formally the entire app identification procedure isdescribed in Algorithm 1 In the while loop the clusteringsignal can be a timeout for cyclic identification (eg every1 hour) or an app session completed for real-time identifica-tion For the details of selecting the initial signature seeds 119878119889one can refer to literature [30]

Algorithm 1 uses a method similar to the FLOWR system[30] to carry out signature matching (line (6)) and takesadvantage of constrained 119870-means clustering algorithm [36]for flow clustering (line (7)) For signature matching thekey-value pairs in HTTP header are considered as apprsquossignatures and an initial set of seeding app signatures isset up to bootstrap the learning of new ones Comparedto FLOWR we refine the process of counting cooccurrenceof app signatures with the constrained clustering results InFLOWR if start time of 119891119897o1199082 is less than 119879 seconds afterthe start time of 1198911198971199001199081 their signatures will be consideredas a cooccurrence instance However as noted in literature[30] if119879 is overestimated FLOWR ismore likely tomix flowsfrom different apps thus inducing noise and overutilizingsystem resources To overcome this problem in AppFA wefurther consider the clustering results to count cooccurrenceof app signatures besides temporal information That meansonly the flows that occurred in 119879 seconds in the same clusterwill be counted as cooccurrenceThis will reduce noises sincethe flows are filtered by constrained clustering

After signature matching constrained 119870-means cluster-ing is carried out to identify the remaining unknown flowsAlgorithm 2 shows the constrained clustering algorithmexploited in AppFA In the constrained flow clustering theflows identified by signature matching which belong to thesame app must be clustered into one cluster (must-link con-straints lines (4)ndash(8) in Algorithm 2) and those generatedby different apps must be clustered into different clusters(cannot-link constraints lines (9)ndash(13) in Algorithm 2) Theclustering features are listed in Table 3 (totally 11 features)Weselect the time of the first packet sent as one of the featuresbecause flows observed within short time intervals are likelyto come from the same app [32]The other features are chosenas they are proved to be efficient in clustering network traffic[36]

In Algorithm 2 we set the value of cluster number 119901 to beequal to the number of apps identified by signaturematchingThis is because the accuracy of mobile app identification isalready higher than 95 [29 32] and it is proper to assumethat the popular apps (malware writers usually graft somemalicious code on popular apps to ensure a wide diffusion

6 Security and Communication Networks

Flow sessionspacket featuresand contents

SignatureMatching

ConstrainedClustering

App sessions

App 1

App 2

Figure 4 Illustration of the basic idea of app identification Solid and dotted lines represent different apps and black color indicates that theflows have not been identified

(1) let 119878 denotes signature set and 119878119889 stands for the signatureseeds

(2) while the clustering signal is received do(3) if 119878 is empty then(4) 119878 = 119878119889(5) end if(6) carry out signature matching(7) carry out constrained flow clustering(8) update 119878 with the clustering results(9) end while

Algorithm 1 App traffic clustering procedure

Table 3 Identification feature set

Time of the first packet sentNumber of packetsVolume of bytesMin max mean and std dev of packet sizeMin max mean and std dev of interpacket time

Table 4 Network behavior profile feature set

Number of flowsNumber of packets of outgoing and incoming flowsVolume of bytes of outgoing and incoming flowsFirst 1198961 results of KPCA transformations of outgoing packet sizesFirst 1198962 results of KPCA transformations of incoming packet sizesFirst 1198961 results of KPCA transformations of outgoing packetintervalsFirst 1198962 results of KPCA transformations of incoming packetintervals

of their malicious code) can all be recognized Thereforeeach cluster will correspond to one app when the constrained119870-means clustering is finished This is appropriate for thefollowing network behavior profile construction and mali-cious behavior detection In practice if the actual numberof apps is greater than 119901 namely some apps cannot beidentified by signature matching the corresponding flowswill be misclustered This may change network behaviors ofapps since unrelated flows will be included and the traffic

features such as the number of packets and the volume ofbytes will be enlarged In this work we use Kernel PrincipalComponent Analysis to remit this problem as described inthe following subsection

52 Network Behavior Profile Construction After the appidentification the network behavior profiles for apps areconstructed from app sessions In AppFA we define the apprsquosnetwork behavior profile as a set of chosen network trafficfeatures as listed in Table 4 Formally the network behaviorprofile can be defined as follows

119899119890119905119861119890ℎ119886V119894119900119903119875119903119900119891119894119897119890 fl 1198911198901198861199051199061199031198901 119891119890119886119905119906119903119890119899 119891119890119886119905119906119903119890119894 isin 119888119900119899119899119890119888119905119894119900119899 119891119890119886119905119906119903119890119904 119889119886119905119886 119891119890119886119905119906119903119890119904

(1)

Since malicious apps need to establish network connec-tions to transmit confidential data or carry out defined attacksteps [18] in this work we mainly choose connection featuresand data features for constructing appsrsquo network behaviorprofiles as shown in (1) The connection features describehow many network connections have been established andthe data features represent characteristics of packets Thewhole selected features for consisting of appsrsquo network behav-ior profiles are listed in Table 4 In the table the first two linesare connection features and the rest are data features

Furthermore for overcoming the misclassification prob-lem (as illustrated in Section 51) and distinguishing minortraffic variations from significant differences we do notuse these features directly Instead KPCA (Kernel PrincipalComponent Analysis) is applied to transform basic featuressuch as packet size and packet time interval Basically KPCA

Security and Communication Networks 7

Input Data set 119865 cluster number 119901 must-link constraints119871119898 sube 119865 times 119865 cannot-link constraints 119871 119888 sube 119865 times 119865

Output Flow clusters(1) Let 1198881 sdot sdot sdot 119888119901 be the initial cluster centers(2) for each flow 119891119894 in 119865 do(3) select the closest cluster 119888(4) for each (119891119894 119891) isin 119871119898 do(5) if119891 notin 119888 then(6) goto step (2)(7) end if(8) end for(9) for each (119891119894 119891) isin 119871 119888 do(10) if119891 isin 119888 then(11) goto step (2)(12) end if(13) end for(14) assign 119891119894 to the cluster 119888(15) end for(16) for each cluster 119888119894 do(17) update its center by averaging all of the flows 119891119895 that

have been assigned to it(18) end for(19) iterate between step (2) and step (18) until convergence(20) return 1198881 sdot sdot sdot 119888119901

Algorithm 2 Constrained flow clustering algorithm

is one approach of generalizing linear PCA into nonlinearcase using the kernel method It has been proved that KPCAhas the best performance in feature extraction and is robust tonoise [37] The details of KPCA used in AppFA are describedas follows

First Gaussian function defined in (2) is selected as thekernel function

119892 (119909 119910) = exp (minus120574 1003817100381710038171003817119909 minus 11991010038171003817100381710038172) (2)

where the value of 120574 is set to 0001 as indicated in [37]Then we compute a Gramkernel matrix 119870 with

119870119894119895 = 119892 (119909(119894) 119909(119895)) (3)

Next the kernel matrix 119870 is centered via the followingfunction

119870centered = 119870 minus 1119873119870 minus 1198701119873 + 1119873119870= (119868 minus 1119873) 119870 (119868 minus 1119873)

(4)

where 1119873 is an 119873 times 119873 matrix with all elements equal to 1119873and 119873 is the number of data points

After that the nonzero eigenvalues 120582119894 and the eigenvec-tors 120572119894 of the centered kernel matrix 119870centered are calculated asfollows

119870centered120572119894 = 120582119894120572119894 (5)

Also the eigenvectors 120572119894 are normalized as

120572119894 =1

radic120582119894119873120572119894 (6)

Finally we sort the eigenvectors in the descending orderof corresponding eigenvalues and perform projections ontothe given subset of eigenvectors This step can be representedas follows

119889119895 =119873

sum119894=1

120572119895119894119892 (119909 119909119894) 119895 = 1 2 119896 (7)

where 119896 is the dimension of the new dataFor each app the length of its network behavior profile is

(1 + 2 lowast (119891119900 + 119891119894) + 2 lowast (1198961 + 1198962)) and 119891119900 is the number ofoutgoing flows and 119891119894 is the number of incoming flows

53 Malicious App Detection With appsrsquo network behaviorprofiles we propose to use peer group analysis [38] to detectmalicious apps The main idea of the detection method isillustrated in Figure 5 An apprsquos network behavior profileis compared with both its historical data and the profilesof its peer group for malware detection Comparison withthe historical profiles can help us find out self-updatingmalicious apps [7] and comparison with the profiles of itspeer group can detect repackagedmalicious apps as observedin Section 2

For AppFA an apprsquos peer group is defined as the set of itsbehavior-similar apps Apps are considered behavior-similarif they satisfy the following two conditions A their mainfunctionality is similar for example all for mailing B theirnetwork behavior profiles are similar If any of the above twoconditions is not met the peer group will be empty and wewill only compare appsrsquo profiles with their historical data

8 Security and Communication Networks

App1 for detection

Update the peergroup of App1

Detection

App1 lt12 100 5000 gt

Network behaviorprofile comparison

Network behaviorprofile comparison

Historical data Peer grouplt11 110 5020 gt

lt10 130 4600 gt

lt12 100 5000 gt

lt12 101 5010 gt

e profile is similar to thehistory App1 has not beeninjected in malicious code

App2 lt20 400 6000 gt

App3 lt30 500 7000 gt

App4 lt25 450 6500 gt

App5 lt19 370 5580 gt

e profile is significantly differentfrom its peer group (behavior-similar apps) App1 may be a fakeapp

Figure 5 Illustration of the procedure of malicious app detection The numbers in angle brackets are examples of features listed in Table 4

Figure 6 Example of similar apps recommended by Google Play

In practice one can resort to app stores such as GooglePlay to find out candidates that will meet condition AFigure 6 shows that when viewing the details of an appGoogle Play will recommend the similar ones These similarapps are determined by several features such as category ofapps keywords in the title and description and size of the apk(httpswwwquoracomHow-does-the-Google-Play-Store-determine-similar-apps) Generally these recommendedapps belong to the same category and have similar function-ality Therefore in order to determine the peer group of anapp we first use the similar apps recommended by app storessuch as Google Play as candidates and later filter them byconditionB

For condition B the Euclidean distance is used tomeasure the similarity of network behavior profiles Suppose119860 is the app to be analyzed and 119861 is the app satisfying con-dition A for 119860 The similarity of network behavior profilesbetween 119860 and 119861 is calculated as

119889 (119860 119861)

= radic(119899119861119875 (119860) minus 119899119861119875 (119861)) (119899119861119875 (119860) minus 119899119861119875 (119861))119879(8)

where 119899119861119875 is the abbreviation of 119899119890119905119861119890ℎ119886V119894119900119903119875119903119900119891119894119897119890 definedin (1)

Once the similarities between 119860 and the recommendedapps are calculated by (8) the results are sorted in the orderof increasing distance (ie decreasing similarity) The first119892 apps will be selected as the peer group members of 119860 asshown in (9) The optimal value of 119892 can be determined bythe experiments which is discussed in the next section Notethat the peer group members can be updated For example ifone of the peer groupmembers has been flagged asmaliciousit will be removed from the peer group and a new one will beadded AppFA can also reselect the peer groups every 119879 timeinterval

peerGroup (119860) fl app1 app119892

where 119889 (119860 app119894) ge 119889 (119860 app119895) if 119894 lt 119895(9)

As illustrated in Figures 2 and 5 for an identified appAppFA first constructs its network behavior profile and then

Security and Communication Networks 9

compares the profile with both its historical profiles andthe profiles of its peer groups for malicious apps detectionDenote 119900 as the profile of the identified app 119860 and 119866 as thematrix of the compared profiles (historical profiles or peergroup profiles) Each column of 119866 is one profile and thecolumn length is 119897 (119897 = (1 + 2 lowast (119891119900 + 119891119894) + 2 lowast (1198961 + 1198962)))AppFAmakes the feature vectors the same length by paddingzeros

For peer group analysis firstly 119900 and 119866 are normalized bymin-max normalization procedure defined in

119900119894 =119900119894

max (119900119894 1198661198941 1198661198942 119866119894119892)

119866119894119895 =119866119894119895

max (119900119894 1198661198941 1198661198942 119866119894119892)

119894 = 1 2 119897 119895 = 1 2 119892

(10)

Then the distance between 119900 and 119866 is calculated byMahalanobis distance

120591 = radic(119900 minus 119898)119879 119878minus1 (119900 minus 119898) (11)

where 119898 is the weighted mean vector of profiles

1198981198941 =119892

sum119895=1

119908119860119895119866119894119895 1 le 119894 le 119897 (12)

and 119878 is the covariance matrix and it is calculated in

119878 = 119864 (119866 minus 119864 [119866]) (119866 minus 119864 [119866])119879 (13)

In (12) 119908119860119895 is the weight of the 119895th closest peer groupmember of the analyzed app 119860 The weights of peer groupmembers are obtained from their proximity to 119860 In detailwe define the proximity of the 119895th closest peer groupmemberof the target app 119860 in

prox119860119895 = exp (minus119889 (119860 119895)) (14)

where 119889(119860 119895) is the Euclidean distance between app 119860 and its119895th closest peer group member defined in (8) Based on theproximity measure we defined above the weight of the 119895thclosest peer group member of app 119860 is defined in

119908119860119895 =prox119860119895

sum119892119895=1 prox119860119895(15)

Finally the state of apps is judged as follows

120591 gt 119905119904 malicious app

120591 le 119905119904 normal app(16)

where 119905119904 is the threshold and it can be set to different valuesby network administrators based on the actual network con-dition We evaluated different values of 119905119904 in our experiment

Internet

ServerUbuntu 1604

functions

Testing mobiledevices(HUAWEI andMI Phones)

(i) Access point(ii) Traffic collection

(iii) Data analysis

Samples(apps)

Figure 7 Experimental setup for the detection of mobile maliciousapps

6 Evaluation

61 Data Collection

Experimental Setup The experimental setup used for AppFAis shown in Figure 7 We have implemented a prototypeof AppFA with the help of ourmon [39] and CCCG [40]Ourmon is an open-source networkmonitoring and anomalydetection system and CCCG is a general framework forconstrained clustering A Ubuntu 1604 computer has beenconfigured as the access point For packet capturing thesmart phone is connected to the Internet by WIFI andnetwork traffic is collected at the access point by tcpdump(httpwwwtcpdumporg) Each pcap file is fixed up to100MB When packet capture is complete the TCP andUPD flows are split by SplitCap (httpswwwnetreseccompage=SplitCap) from pcap files For appsrsquo traffic clusteringand network behavior profile construction we use tshark(httpswwwwiresharkorgdocsman-pagestsharkhtml) toextract IP address packet sizes and packet interval timesfrom network flows After that a Python program is writtento calculate the statistical features as listed in Tables 3 and 4The KPCA transformation is accomplished with the help ofscikit-learn (httpscikit-learnorg)The data analysis is alsocompleted in theUbuntu 1604 computer (with 4GBmemoryand Pentium Dual-Core CPU T4500)

TrafficGenerationWe use the publicMalGenome [19] datasetin our experiments Since we mainly focus on repackag-ing and updating malwares 93 typical information collectionmalwares including repackaging and updating attack typesare selected from MalGenome to test the detection rate ofAppFA (the malicious apps dataset was downloaded fromhttpwwwmalgenomeprojectorg in 2015 These selectedmalwares are the ones run without any errors) We haveinstalled all these malicious apps in HUAWEI Honor 8 and aMI Note 4 phones and run these apps one by one 50 times tocollect network traffic In detail these malwares are analyzedand run by GroddDroid [41] to make sure that maliciouscodes will be triggered We also use GroddDroid to runother malwares besides the selected ones and their trafficwill be mainly used in the local detection as described in

10 Security and Communication Networks

Section 63 Particularly the background traffic includingweather forecasting email checking QQ and Webchat (QQand Webchat are the most famous apps in China) is allowedin our experiments The ground truth of the originator ofnetwork traffic is determined by Packet Capture (httpsplaygooglecomstoreappsdetailsid=appgreyshirtssslcaptureamphl=zh) to test the accuracy of app traffic clustering

For testing the false positive rates of AppFA we rely onthe GooglePlayAppsCrawlerpy project to identify the 100most popular free apps These 100 free apps have also beenrun one by one 50 times While different from malwaresthe benign apps are run by droidbot [42] to make sure togenerate enough traffic With these captured traffic we takeup all nonzero packets for app traffic clustering and networkbehavior profile construction Namely in our experimentsthe value of 119899 in Packet filter model is set to minus1

The data used in the experiments is summarized inTable 5 TrafficSet 1 consists of network traffic generated bythe selected 93 malwares and TrafficSet 2 consists of networktraffic of benign apps TrafficSet 3 is the traffic of all malwaresthus it includes TrafficSet 1 TrafficSet 3 is mainly used formalicious apps detection in local networks

62 Experimental Results

Accuracy of Traffic Clustering TrafficSet 1 is used for testingthe accuracy of app traffic clustering and Packet Capture isexploited to get the ground truth of the originator of networktraffic Recall that in Algorithm 1 app traffic clustering mustbe started with signature seeds In order to obtain appsrsquosignature seeds we randomly choose several HTTP flowsfor each app and extract the key-value pairs in their HTTPheaders as signature seeds as described in Section 4 Thevalue of cluster number 119901 in Algorithm 2 is set to 97 (97 =93 + 4) since there are 93 malwares and 4 background appsthat is weather forecasting email checking QQ andWebchatThe experimental results are shown in Table 6

As shown in Table 6 for each app when 5 HTTP flowsare chosen to generate signature seeds only 601 flows (in all223220 flows) aremisclusteredThis proves the efficiencies ofour proposed method for app traffic clustering

Experimental Results of Malicious App Detection Similarto previous work in order to measure the effectiveness ofmalicious apps detection accuracy (detection rate) and falsepositive rate metrics are defined in (17) and (18) where TPFN FP and TN stand for true positive false negative falsepositive and true negative respectively

Accuracy = TPTP + FN

(17)

False positive rate = FPFP + TN

(18)

We first examine the detection rates and false positiverates of AppFA with different values of 119905119904 and 119896119894 (119894 = 1 2)For the selected malicious apps the apps with the samefunctionality (satisfying condition A) are determined byGoogle Play We first analyze the malicious appsrsquo function-alities manually and choose a keyword for each app Then

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

05

06

07

08

09

1

Det

ectio

n ra

te

06 08 1 12 14 16 18 2 04ts

Figure 8 Detection rates of AppFA with different values of 119905119904 and119896119894 (119894 = 1 2)

we search the keyword in Google Play and use the returnedapps as candidates After that the peer group is determined by(8) and (9) In our experiments the largest value of membercount 119892 is set to 10 Note that 119905119904 is the threshold defined in(16) and 119896119894 is the number of KPCA transformations definedin Table 4 We set the member count 119892 in peer group analysisto 5 namely for each tested app its top 5 similar apps areselected and analyzedThedetection rate is shown in Figure 8and the false positive rate is depicted in Figure 9As illustratedin Figures 8 and 9 the first 4 KPCA transformations of packetsizes and intervals are good enough for constructing networkbehavior profileWhen 1198961 = 1198962 = 4 and 119905119904 gt 12 the detectionrate is higher than 90 and the false positive rate is lowerthan 04 The detection rate is as high as 97 when 119905119904 = 2The experimental results demonstrate the effectiveness of ourproposed approach

We then test AppFA with different value of 119892 The valuesof 119905119904 and 119896119894 (119894 = 1 2) are set to 2 and 4 respectivelyThe experimental results are shown in Figures 10 and 11 Asillustrated in these figures when 119892 increases the detectionrates and the false positive rates are slightly changed Thiscan be explained as the Mahalanobis distance 120591 defined in(11) is calculated by considering the weighted mean vector ofprofiles So the most similar app has the largest weight andthe added peers will have smaller weights and thus providelittle contribution to the final detection As observed in ourexperiments 119892 = 5 is suitable for practical deployment

Further AppFA with different value of cluster number 119901is also tested Note that the actual cluster number is 97 forTrafficSet 1 In order to evaluate how app traffic clusteringimpacts the detection of repackaged malware we slightlychange the value of 119901 and the experiments results are asshown in Figure 12 In the experiments we set 1198961 = 1198962 = 4119905119904 = 2 and 119892 = 5 As shown in Figure 12 the cluster numberindeed impacts the detection rate of AppFA Particularlywhen 119901 = 99 the detection rate is only 88 namely only 82

Security and Communication Networks 11

Table 5 Data summary

Traffic set Apps Quantity Number of collectedflows (TCP and UPD) Amount of data (bytes)

TrafficSet 1 Selected malicious apps(for testing accuracy of traffic clustering and detection rates) 93 223220 538G

TrafficSet 2 Benign apps(for testing false positive rates of AppFA) 100 305103 640G

TrafficSet 3 All malicious apps(for malicious apps detection in local networks) 1260 asymp31 million 1872G

Table 6 Experimental results of traffic clustering

Number of HTTP flows forgenerating signature seeds

Number of flowscorrectly clustered

Number ofmisclustered

flows2 201357 218633 218946 42744 221384 18365 222619 601

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

00005

0001

00015

0002

00025

0003

00035

0004

False

pos

itive

rate

06 08 1 12 14 16 18 2 04ts

Figure 9 False positive rates of AppFA with different values of 119905119904and 119896119894(119894 = 1 2)

malicious apps are correctly detected However when 119901 lt 97(119901 = 95 and 96 in Figure 12) the detection rates are almost thesame namely 97Therefore AppFAwill be competent witha few unknown apps when carrying out app traffic clustering

Finally we evaluate AppFA with the remaining maliciousapps besides the selected 93 samplesThe rest of themaliciousapps contain repackaged and other types of malwares Thesemalwares are found to have some errors when run on ourphones However they also produced considerable networktraffic We test AppFA with these apps for examining thegeneral capability of AppFA The parameters are set up asfollows 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 The experimentalresults are shown in Table 7 The detection rate is 734and the false positive rate is 35 Compared to the results

1 3 5 7 10g

Detection rate

0

01

02

03

04

05

06

07

08

09

1

rate

()

Figure 10 Detection rates of AppFA with different values of 119892

1 3 5 7 10g

False positive rate

times10minus3

0

1

2

3

4

5

6

7

rate

()

Figure 11 False positive rates of AppFA with different values of 119892

displayed in Figures 8 9 10 and 11 the detection rate issignificantly reduced and the false positive rate is increasedThe possible reasons will be discussed in Section 64

Note that AppFA is designed to perform nearly real-timemalicious apps detection thus efficiency is also a big concernTable 8 presents the computational performance for major

12 Security and Communication Networks

99989695value of p

0

01

02

03

04

05

06

07

08

09

1

dete

ctio

n ra

te (

)

Figure 12 Detection rates of AppFA with different values of 119901

Table 7 Experimental results for the malicious apps besides theselected 93 samples

Detection rate False positive rate734 35

Table 8 The computational performance

Procedures TimeConstrained flow clustering (Algorithm 1) 3minApp identification (Algorithm 2) 52minNetwork behavior profile construction 17minPeer group analysis 45 sec

steps in terms of average running time using the experimentaldata Generally the proposedmethod can be deployed onlineto detect malicious apps Specifically it is fast typically lessthan one minute to exploit peer group analysis to performmalicious app detection The app identification procedureis the most time-consuming step because the signaturematching and constrained flow clustering are carried out oneach flow Parallel computing such as cloud resources can befurther used to speed the malicious apps detection and scaleup to more Internet traffic data

Detection of Malicious Repackaged Apps for Encrypted TrafficAs shown in Table 4 AppFA does not consider traffic contentand mainly uses traffic statistical features to detect maliciousAndroid apps Thus AppFA can deal with encrypted trafficSince in MalGenome there is no app whose all connectionsto attack servers are encrypted we have created a newAndroid repackaged malware that communicates with allmalicious servers by encrypted connections In detail wefirst modify AnserverBotrsquos source code and make sure itsnetwork connectionswill be encrypted byTLS protocolThenwe graft it into a popular game app aircomaceviralmotox3Finally we collect network traffic of the repackaged aircomaceviralmotox3 and its peer group In the experiments

Table 9 Experimental results when searching similar apps locally

Detection rate False positive rate692 47

the selected peer group of the repackaged aircomaceviralmotox3 is comwordmobilesbikeRacing comtopfreegamesbikeracefreeworld comskgamestrafficrider comtomicowheel-iechallenge and madmotoracingbike We repeat to detectthe repackaged aircomaceviralmotox3 10 times and thedetection rate is 100 The results show that AppFA canhandle encrypted network traffic

63 Detection in Local Networks In the above experimentsthe peer group for each app is determinedwith the help of appstore such as theGoogle PlayThatmeansAppFAhas to accessthe Internet when performing malicious apps detection Infact AppFA can also be enhanced to choose the appsrsquo peergroups locally namely from the set of already identified appsNote that AppFA uses signature matching to identify appson the network and the identified ones can be treated as amini app store By this AppFA can work on local networksTo find out similar apps from the already identified ones anefficient method based on information retrieval technologiesis proposed and is shown in Algorithm 3

The basic idea of Algorithm 3 is that similar apps mayhave similar functionalities and the context (keywords) maybe similarTherefore we useweb-page searching technologiesto match similar app The 119872119886119905119888ℎ119894119899119892 function in line (4) canbe realized by TF-IDF (Term Frequency-Inverse DocumentFrequency) [43] Again peer groups can be chose by (8) withthe results returned by Algorithm 3

We have implemented the local similar app searching andevaluated AppFArsquos performance in local networks All appsincluding all malicious and benign apps are considered asidentified apps and Algorithm 3 is carried out to find outsimilar apps When 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 for the93 typical repackaged malwares the experimental results arelisted in Table 9 The detection rate is 692 lower than 97in Figure 8The experimental results indicate that AppFA canwork in local networks and the proper peer groups do affectthe final detection significantly

64 Discussion Figures 8 9 10 and 11 show that AppFAgets high detection rates and low false positive rates whendetecting repackaged apps However the detection rate maybecome lower when it is applied to other types of maliciousapps such as sending SMS without notification This may bedue to the fact that other types of mobile malwares transmitless data through network than repackaged apps Note thatthe detection features used inAppFA aremainly related to thepackets sizes and the number of flows So the detection ratemay be declined if the appsrsquo network behaviors are slightlychanged

Compared to the latest work done in [6] AppFA has aslightly lower detection rate (the detection rate is 95ndash999 asreported in [6]) However Garg et alrsquos method [6] consideredonly 18 malware apps and 14 genuine apps and assumed

Security and Communication Networks 13

(1) extract all plaintext keywords from HTTP headers of thetested app 119860 Denote the set of keywords as 119882(119860)

(2) extract all plaintext keywords from HTTP headers of otheridentified apps Denote the set of keywords of identifiedapp 119894 as 119882(119894) 119894 = 0 1 119899 minus 1

(3) while 0 le 119899 do(4) 119878119888119900119903119890119904[119894]V119886119897119906119890 = 119872119886119905119888ℎ119894119899119892(119882(119860) 119882(119894))(5) 119878119888119900119903119890119904[119894]119886119901119901 = 119894(6) 119894 + +(7) end while(8) sorting 119878119888119900119903119890119904[] with 119878119888119900119903119890119904[119894]V119886119897119906119890 by decreasing order(9) return the first 119898119878119888119900119903119890119904[119894]119886119901119901 119894 = 0 1 119898 minus 1

Algorithm 3 Searching similar apps locally

Table 10 Experimental results when the fact that what traffic isgenerated by what app is exactly known In the experiments 1198961 =1198962 = 4 119905119904 = 2 and 119892 = 5

Selected maliciousapps

Remaining maliciousapps

Detection rate 983 85False positiverate 003 27

that the traffic generated by apps was already known (theirdata were collected on the mobile device so their methodis actually a kind of client-side approach) In our workmuch larger samples are taken into account and constrainedclustering is exploited to determine to what app the flowsbelong The accuracy of clustering is also a factor affectingthe detection rate To confirm this we further take the sameassumption as Garg et alrsquos method [6] namely we assumethe accuracy of Algorithm 1 is 100 and know exactly whattraffic is generated by what app (in our experiments we useapp Packet Capture to obtain perfect ground truth of whatflows came from what app) At this time the experimentalresults are given in Table 10 Apparently the detection ratesare much improved

In practice one can deploy both our method and othermethods such as [6 15] simultaneously As indicated by ourexperiments AppFA is efficient in detecting repackaged andself-updatingmalicious apps In fact [15] is suitable forHTTPanalysis and [6] can detect CampC communication efficientlySo our work can be a complement of existing work onmalicious app detection

In this work AppFA mainly uses the statistical features(refer to Table 4) of network traffic to detect Android mali-cious repackaged applications Therefore attackers (repack-aged apps) may change the characteristics of their traffic toremain undetectable But it is quite difficult in practice Asillustrated in Section 2 repackaged apps usually introduceadditional network traffic thus the attackers must removesome normal network connections to keep network behav-iors the same to evade detection However the removal of

normal network connections will impact the functionalitiesof apps and may cause errors This gives additional clues todetect repackaged malware Meanwhile the disappearanceof some network connections may be abnormal as wellTherefore our proposed method is hard to evade

7 Conclusion

In this paper we propose a novel approach AppFA to detectmalicious apps at the network level In AppFA apps are firstidentified from network traffic by signature matching andconstrained clustering Then Kernel Principal ComponentAnalysis is employed to construct app network behavior pro-file and distinguish minor traffic variations from significantdifferences At last we take advantage of peer group analysisto detect malicious apps to avoid time-consuming offlinemodel training Notably AppFA does not need to installprograms or modify operating systems to collect featureinformation Thus it is very convenient to be used and thecost is low The experimental results show that AppFA candetect Android repackaged malware with the detection ratehigher than 90 and a false positive rate lower than 04

The appsrsquo network behaviorsmay be significantly changedby version update and thus cause false positives This needsto be further investigated In the future work AppFA willbe extended to include more network traffic features anddetect more types of malicious apps We will also makeAppFA an open-source project to facilitate further researchesof malicious app detection

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by National Natural Science Founda-tion of China under Grants 61702282 and 61502250 NaturalScience Foundation of the Jiangsu Higher Education Insti-tutions of China under Grant 17KJB520023 NUPTSF underGrant NY217143 and Nanjing Forestry University (GXL016CX2016026)

14 Security and Communication Networks

References

[1] Android statistics Number of android applications September2017 httpwwwappbraincomstatsnumber-of-android-apps

[2] A Mylonas M Theoharidou and D Gritzalis ldquoAssessingPrivacy Risks in Android A User-Centric Approachrdquo in RiskAssessment and Risk-Driven Testing Lecture Notes in ComputerScience pp 21ndash37 Springer International Publishing Cham2014

[3] Does Does your mobile anti-virus app protect or infect you thetruth behind du antivirus security September 2017 httpsresearchcheckpointcommobile-anti-virus-app-protect-infect-truth-behind-du-antivirus-security

[4] A P Felt M Finifter E Chin S Hanna and D Wagner ldquoAsurvey of mobile malware in the wildrdquo in Proceedings of the 1stACM Workshop on Security and Privacy in Smartphones andMobile Devices (SPSM rsquo11) held in Association with the 18th ACMConference on Computer and Communications Security (CCSrsquo11) pp 3ndash14 October 2011

[5] K Tam A Feizollah N B Anuar R Salleh and L Caval-laro ldquoThe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys vol 49 no 4 Article ID3017427 2017

[6] S Garg S K Peddoju and A K Sarje ldquoNetwork-based detec-tion of Android malicious appsrdquo International Journal of Infor-mation Security pp 1ndash16 2016

[7] A Shabtai L Tenenboim-Chekina D Mimran L Rokach BShapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquo Com-puters amp Security vol 43 pp 1ndash18 2014

[8] L Xie X Zhang J-P Seifert and S Zhu ldquoPBMDS a behavior-based malware detection system for cellphone devicesrdquo inProceedings of the 3rd ACM Conference on Wireless NetworkSecurity (WiSec rsquo10) pp 37ndash48 Hoboken NJ USAMarch 2010

[9] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacy inSmartphones and Mobile Devices (SPSM rsquo11) Held in Associationwith the 18th ACM Conference on Computer and Communica-tions Security (CCS rsquo11) pp 15ndash25 October 2011

[10] GDini FMartinelli A Saracino andD Sgandurra ldquoMADAMa multi-level anomaly detector for android malwarerdquo in Com-puter Network Security vol 7531 of Lecture Notes in ComputerScience pp 240ndash253 Springer Berlin Germany 2012

[11] Y Zhang M Yang B Xu et al ldquoVetting undesirable behaviorsin android apps with permission use analysisrdquo in Proceedingsof the in Proceedings of the 2013 ACM SIGSAC conference onComputer amp communications security pp 611ndash622 2013

[12] S Wang Z Chen X Li L Wang K Ji and C Zhao ldquoAndroidMalware Clustering Analysis on Network-Level Behaviorrdquo inIntelligent Computing Theories and Application vol 10361 ofLecture Notes in Computer Science pp 796ndash807 SpringerInternational Publishing Cham Switzerland 2017

[13] A S Shamili C Bauckhage and T Alpcan ldquoMalware detectionon mobile devices using distributed machine learningrdquo inProceedings of the 20th International Conference on PatternRecognition (ICPR rsquo10) pp 4348ndash4351 August 2010

[14] Y Wang J Wei and K Vangury ldquoBring your own device secu-rity issues and challengesrdquo in Proceedings of the 2014 IEEE 11thConsumer Communications and Networking Conference CCNC2014 pp 80ndash85 USA January 2014

[15] P S Chen S-C Lin and C-H Sun ldquoSimple and effectivemethod for detecting abnormal internet behaviors of mobiledevicesrdquo Information Sciences vol 321 Article ID 11536 pp 193ndash204 2015

[16] Global mobile os market share in sales to end users from 1stquarter 2009 to 1st quarter 2017 2017 httpswwwstatistacomstatistics266136global-market-share-held-by-smartphone-op-eratingsystems

[17] ldquoIt threat evolution q3 2017 statisticsrdquo November 2017 httpssecurelistcomit-threat-evolution-q3-2017-statistics83131

[18] L Li D Li T F Bissyande et al ldquoUnderstanding Android AppPiggybacking A Systematic Study of Malicious Code GraftingrdquoIEEE Transactions on Information Forensics and Security vol 12no 6 pp 1269ndash1284 2017

[19] Y Zhou and X Jiang ldquoDissecting android malware characteri-zation and evolutionrdquo in Proceedings of the 33rd IEEE Sympo-sium on Security and Privacy pp 95ndash109 San Francisco CalifUSA May 2012

[20] P Faruki V Laxmi A Bharmal M S Gaur and V Gan-moor ldquoAndroSimilar Robust signature for detecting variantsof Android malwarerdquo Journal of Information Security andApplications vol 22 pp 66ndash80 2015

[21] Y Zhou and X JiangAn analysis of the anserverbot trojan 2011httpwwwcscncsuedufacultyjiangpubsAnserverBot Ana-lysispdf

[22] P Faruki A Bharmal V Laxmi et al ldquoAndroid security a sur-vey of issues malware penetration and defensesrdquo IEEE Com-munications Surveys amp Tutorials vol 17 no 2 pp 998ndash10222015

[23] P Yan and Z Yan ldquoA survey on dynamic mobile malwaredetectionrdquo Software Quality Journal pp 1ndash29 2017

[24] S Dai Y Liu T Wang T Wei and W Zou ldquoBehavior-basedmalware detection on mobile phonerdquo in Proceedings of the 20106th International Conference on Wireless Communications Net-working and Mobile Computing WiCOM 2010 China Septem-ber 2010

[25] D Damopoulos G Kambourakis S Gritzalis and S O ParkldquoExposing mobile malware from the inside (or what is yourmobile app really doing)rdquo Peer-to-Peer Networking and Appli-cations vol 7 no 4 pp 687ndash697 2014

[26] D Damopoulos G Kambourakis and G Portokalidis ldquoThebest of both worlds A framework for the synergistic operationof host and cloud anomaly-based IDS for smartphonesrdquo inProceedings of the 7th European Workshop on System SecurityEuroSec 2014 Netherlands April 2014

[27] Q Xu J Erman A Gerber Z Mao J Pang and S Venkatara-man ldquoIdentifying diverse usage behaviors of smartphone appsrdquoin Proceedings of the ACM SIGCOMM Internet MeasurementConference (IMC rsquo11) pp 329ndash344 November 2011

[28] S Dai A Tongaonkar X Wang A Nucci and D Song ldquoNet-workProfiler Towards automatic fingerprinting of Androidappsrdquo in Proceedings of the 32nd IEEE Conference on ComputerCommunications IEEE INFOCOM 2013 pp 809ndash817 ItalyApril 2013

[29] Q Xu T Andrews Y Liao et al ldquoFlowr A self-learning systemfor classifying mobile application trafficrdquo in Proceedings of the2014 ACM SIGMETRICS International Conference on Measure-ment and Modeling of Computer Systems SIGMETRICS 2014pp 569-570 usa June 2014

[30] Q Xu Y Liao S Miskovic et al ldquoAutomatic generation ofmobile app signatures from traffic observationsrdquo in Proceedings

Security and Communication Networks 15

of the 34th IEEE Annual Conference on Computer Communica-tions and Networks IEEE INFOCOM 2015 pp 1481ndash1489 HongKong May 2015

[31] J Sun L She H Chen et al ldquoAutomatically identifying appsin mobile trafficrdquo Concurrency Computation vol 28 no 14 pp3927ndash3941 2016

[32] G He B Xu and H Zhu ldquoIdentifying Mobile Applications forEncrypted Network Trafficrdquo in Proceedings of the 5th Interna-tional Conference on Advanced Cloud and Big Data CBD 2017pp 279ndash284 China August 2017

[33] G He M Yang X Gu J Luo and Y Ma ldquoA novel activewebsite fingerprinting attack against Tor anonymous systemrdquoin Proceedings of the 2014 18th IEEE International Conference onComputer Supported CooperativeWork inDesign CSCWD2014pp 112ndash117 Taiwan May 2014

[34] G He M Yang J Luo and X Gu ldquoA novel application classi-fication attack against Torrdquo Concurrency and ComputationPractice and Experience vol 27 no 18 pp 5640ndash5661 2015

[35] B Miller L Huang A D Joseph and J D Tygar ldquoI know whyyou went to the clinic Risks and realization of HTTPS trafficanalysisrdquo Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes in Bio-informatics) Preface vol 8555 pp 143ndash163 2014

[36] Y Wang Y Xiang J Zhang W Zhou G Wei and L T YangldquoInternet traffic classification using constrained clusteringrdquoIEEE Transactions on Parallel and Distributed Systems vol 25no 11 pp 2932ndash2943 2014

[37] L J Cao K S Chua W K Chong H P Lee and Q M Gu ldquoAcomparison of PCA KPCA and ICA for dimensionality reduc-tion in support vector machinerdquo Neurocomputing vol 55 no1-2 pp 321ndash336 2003

[38] Y Kim and S Y Sohn ldquoStock fraud detection using peer groupanalysisrdquo Expert Systems with Applications vol 39 no 10 pp8986ndash8992 2012

[39] ldquoOurmon network monitoring and anomaly detection systemrdquoApril 2013 httpourmonsourceforgenet

[40] B Babaki T Guns and S Nijssen ldquoConstrained ClusteringUsing Column Generationrdquo in Integration of AI and OR Tech-niques in Constraint Programming vol 8451 of Lecture Notes inComputer Science pp 438ndash454 Springer International Publish-ing Cham 2014

[41] A Abraham R Andriatsimandefitra A Brunelat J-F Lalandeand V Viet Triem Tong ldquoGroddDroid A gorilla for triggeringmalicious behaviorsrdquo in Proceedings of the 10th InternationalConference on Malicious and Unwanted Software MALWARE2015 pp 119ndash127 USA October 2015

[42] Y Li Z Yang Y Guo and X Chen ldquoDroidBot A lightweightUI-guided test input generator for androidrdquo inProceedings of the39th IEEEACM International Conference on Software Engineer-ing Companion ICSE-C 2017 pp 23ndash26 Argentina May 2017

[43] A Aizawa ldquoAn information-theoretic perspective of tf-idfmeasuresrdquo Information Processing amp Management vol 39 no1 pp 45ndash65 2003

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 3: AppFA: A Novel Approach to Detect Malicious …downloads.hindawi.com/journals/scn/2018/2854728.pdfSecurityandCommunicationNetworks F :Exampleofkey-valuepairs. signs, and the packet

Security and Communication Networks 3

Figure 1 Example of network flows of Android app combaiduBaiduMap The traffic is captured and analyzed by NetworkminerNote that there exists several encrypted SSL flows and the numberis more than HTTP traffic

Table 1 Comparison of network behaviors of original appcomcamelgamesmxmotor and its repackaged version (repackagedby AnserverBot)

Total packet sizes Number of tcp connectionsRepackaged app 23Mb 18Original app 1932 kb 11

Table 2 Comparison of network behaviors of similar apps

Total packetsizes

Number of tcpconnections

comcamelgamesmxmotor 1932 kb 11aircomaceviralmotox3m 1828 kb 14comskgamestrafficrider 1751 kb 13comtopfreegamesbikeracefreeworld 1327 kb 9

in Section 5 Meanwhile a novel constrained clusteringalgorithm is elaborated for app traffic clustering Thus ourmethod can be applied on the network straightforwardly anddoes not need to install programs onmobile devices to collectflow information

3 Related Work

There has been extensive work on detectingmaliciousmobileapps Literature [4 5 22 23] gave surveys of mobile malwarein the wild and the proposed techniques for detecting themIn this section we mainly focus on behavior-based malwaredetection methods and only review the most related ones

Generally current behavior-basedmobilemalware detec-tion approaches can be categorized into two main groupsclient-side and server-side detection Client-side detectionapproaches run locally and apply anomalymethods on the setof features which indicate the state of the appThe pBMDS [8]is based on correlating user inputs with system calls to detectanomalous activities A Hidden Markov Model (HMM)is used to learn application and user behaviors from twomajor aspects process state transitions and user operationalpatterns Built upon these two aspects the pBMDS identifiesbehavioral differences between user initiated applicationsand malware compromised ones Zhang et al [11] combineddynamic tracing of the permission requests for resources

usage by applications with tracking sensitive operations onthe granted resources (using taint tracking) This combina-tion enabled them to understand how applications utilize thepermissions to access sensitive system resources Dai et al[24] presented a malware detection system for the WindowsMobile platform They used API interception techniquesfor monitoring and analyzing the applicationrsquos behavior andcompared it to the patterns within the predefined library ofmalicious behavior characteristics Shabtai et al [7] presenteda behavior-based anomaly detection system for detectingmeaningful deviations in a mobile applicationrsquos networkbehavior Semisupervised C45 Decision Tree algorithm wasused for learning the normal behavioral patterns and fordetecting deviations from the applicationrsquos expected behaviorTheir methods were implemented and evaluated on Androiddevices Damopoulos et al [25] proposed a fully fledgedtool able to dynamically analyze any iOS software in termsof method invocation that can be used to trace softwarersquosbehavior to decide if it contains malicious code or not

Server-side detection approaches are carried out onremote servers mainly motivated by limited computationalresources of the mobile device [26] Burguera et al [9] havedeveloped an Android framework named ldquoCrowdroidrdquo thatincludes a client application installed on the deviceThe appli-cationmonitors Linux kernel system calls and sends them to acentralized server after preprocessingOn the server a datasetis built from the list of the system calls the list of runningapplications and the device information 119870-Means algorithmis then used for clustering the applications into two groupsthat is benign and malware applications Shamili et al [13]utilize a distributed Support Vector Machine algorithm formalware detection on a network ofmobile devicesThephonecalls SMSs and data communication related features areused for detection During the training phase support vectors(SVs) are learned locally on each device and then sent to theserver where SVs from all of the client devices are aggregatedFinally the server distributes the whole set of SVs to all theclients and each client updates his own SVs

Unlike the work mentioned above our proposed systemruns at the network level directly without necessarily havingaccess to the mobile devices We notice that the most similarwork was carried out by Chen et al [15] Chen et alrsquos method[15] was also implemented at the network level and identifiedabnormal network behaviors by conducting 3-step checkaction including (1) identifying HTTP POST and HTTPGET packages (2) checking whether the device was exposingunique device identifiers such as IMEI and IMSI and (3)determining the legitimacy of the remote server by queryingthe domain name server So their method can only be appliedto HTTP traffic Contrary to the work of Chen et al [15] weuse the combination of signature matching and constrainedmobile network traffic clustering for app identification andthus our method is suitable for both plaintext and ciphertexttraffic such as HTTPS The work introduced in [6] is alsointended to detect Android malware from network traffic InGarg et al [6] detection features were first extracted fromDNS HTTP and TCP traffic and then machine learningalgorithms (such as Decision Trees Bayesian Networks andRandom Forests) were used to detect malicious mobile apps

4 Security and Communication Networks

Packet Filter SessionBuilder

FlowSession

FlowSession

PacketContentExtractor

Basic FeatureExtractor

IdentificationFeatureGenerator

App TrafficClustering

ProfileFeatureGenerator

AppBehaviorProfileConstructor

MaliciousAppDetection

Appsession

Figure 2 Main components of AppFA

while their detection features were obtained in the mobiledevice and they needed to train classification models offlineIn AppFA we straightforwardly extract detection features onthe network to build appsrsquo network behavior profiles and usepeer group analysis to avoid offline model training

Recently there are an increasing number of researchworks that analyze network traffic to identify app such as[27ndash32] However most of them were focused on plaintextflows (eg HTTP) and tried to collect identification featuresfrom HTTP headers These methods may fail due to theemergence of encrypted network traffic In this paper weinvestigate a new constrained clusteringmethod to cluster thenetwork traffic generated by the same app

4 System Design

The system architecture of AppFA is shown in Figure 2 It isof modular design and each module has special functionsPacket Filtermodule captures link-layer frames or reads themfrom a file and filters them according to configurable rules Inorder to detect malicious apps online and provide support fornetwork management AppFA is designed to analyze the first119899 nonzero packets (packets contain application data) of eachnetwork flow The parameter 119899 is configurable for differentnetwork management purpose For example the value of 119899can be set to a small positive number such as 50 for real-timemalicious app detection and minus1 for full analysis namely con-sidering all nonzero packets in flows In AppFA the packetfilter module can be implemented based on well-knownlibrary such as libpcap (httpwwwtcpdumporgrelease)

Session Builder module organizes network traffic intosessions For app identification and malicious behavior

detectionwe define two types of sessionsflow session and appsession The flow session is defined by the source IP sourceport destination IP destination port transport protocoltuple where source and destination can be swapped andthe transport protocol is mainly considered as TCP andUDP in this work In deployment AppFA determines whena flow session is completed by one of the following threeconditions (1) received 119899 nonzero packets (2) detectingRESTFINpacket (3) timeout for example there is no packetexchanging in 3 minutes With flow sessions one can extractbasic features and packet contents efficiently for app identifi-cation and malicious behavior detection The app session isdefined as the collection set of all flow sessions gen-erated by the same app flow session 1 flow session 2 flow session 119898 The app sessions constructed by SessionBuilder module will be invoked by Profile Feature Generatormodule to construct app network behavior profiles asdepicted in the right of Figure 2

Basic Feature Extractor module extracts basic packetfeatures from flow sessions which include packet sizepacket interarrival time packet order and packet directionThe packet direction feature distinguishes outgoing fromincoming packets Other advanced features such as flowduration total packets sent and received and burst sizescommonly used in traffic analysis [33ndash35] can be calculatedfrom these basic features So we extract these basic featuresfirstly and then generate appropriate advanced features (iden-tification or detection features) for different traffic analysispurposes (clustering or peer group comparing) After thebasic feature extraction a flowmay look like +50 lowast30 +100lowast500 minus1300 lowast20 minus400 where outgoing packet sizes andincoming packet sizes are denoted by positive and negative

Security and Communication Networks 5

Figure 3 Example of key-value pairs

signs and the packet interarrival times are labelled byasterisk The packet order features are also reflected by thenumber sequences inherently For example for the flow+50 lowast30 +100 lowast500 minus1300 lowast20 minus400 the first packetis an outgoing packet whose size is 50 bytes and the secondpacket is also an outgoing packet and the packet time intervalis 30ms The third packet is an incoming packet whose sizeis 1300 bytes The time interval between the second and thethird packet is 500ms and so forth

Note that the packet contents are equally saved throughPacket Content Extractor module for app identification Sim-ilar to work [30] we focus on key-value pairs in HTTP head-ers In detail justniffer (httpjustniffersourceforgenet) isfirst used to transform the raw packet traffic into HTTPmessages Then HTTP messages are tokenized by severaltokenizers such as space ldquornrdquo ldquordquo and ldquoamprdquo and eachHTTPrequest will be broken into various parts including methodpage and query Finally queries are divided into key-valuepairs Figure 3 is an example of key-value pairs used in ourexperiments

After obtaining the basic features from flows we beginto generate identification features in Identification FeatureGeneratormodule After that wewill identify the app for eachflow through App Traffic Clustering module which returnsthe clustering results to Session Builder module for formingapp sessionsWith the app session information detection fea-tures will be created in Profile Feature Generator module Atlast app network behavior profiles will be constructed inAppBehavior Profile Constructor module and malicious apps willbe detected inMalicious App Detectionmodule respectivelyThe independent treatment of different functional modulescan make AppFA architecture clear and scalable

5 Methodology

As depicted in Figure 2 the functionality of app trafficclustering and malicious app detection (including identifica-tiondetection feature generation) are the core componentsof AppFA and the technical details are clarified in the nextsubsections

51 App Traffic Clustering The basic idea of our app trafficclustering method is illustrated in Figure 4 In the figurethere are two apps and the flow sessions of them are repre-sented by solid and dotted lines respectively For each flowsignature matching is first used to identify HTTP flows thathave recognized signatures Note that in this paper we usethe term signature for plaintext matching and feature for

traffic analysis With this step there are two flows identified(labelled as red and green) as shown in the left dashed box ofFigure 4 After the signaturematching constrained clusteringalgorithm (the second dashed box in Figure 4) is exploitedto cluster all flows Compared to the ordinary clusteringalgorithms constrained clustering algorithm adopts back-ground information (ie identified flows) to improve clusteraccuracy By constrained clustering flows such as encryptedtraffic that cannot be identified by signatures will be classifiedinto appropriate clusters that is apps Finally we obtainapp sessions that will be utilized for creating appsrsquo networkbehavior profiles as depicted in the last step in Figure 4

Formally the entire app identification procedure isdescribed in Algorithm 1 In the while loop the clusteringsignal can be a timeout for cyclic identification (eg every1 hour) or an app session completed for real-time identifica-tion For the details of selecting the initial signature seeds 119878119889one can refer to literature [30]

Algorithm 1 uses a method similar to the FLOWR system[30] to carry out signature matching (line (6)) and takesadvantage of constrained 119870-means clustering algorithm [36]for flow clustering (line (7)) For signature matching thekey-value pairs in HTTP header are considered as apprsquossignatures and an initial set of seeding app signatures isset up to bootstrap the learning of new ones Comparedto FLOWR we refine the process of counting cooccurrenceof app signatures with the constrained clustering results InFLOWR if start time of 119891119897o1199082 is less than 119879 seconds afterthe start time of 1198911198971199001199081 their signatures will be consideredas a cooccurrence instance However as noted in literature[30] if119879 is overestimated FLOWR ismore likely tomix flowsfrom different apps thus inducing noise and overutilizingsystem resources To overcome this problem in AppFA wefurther consider the clustering results to count cooccurrenceof app signatures besides temporal information That meansonly the flows that occurred in 119879 seconds in the same clusterwill be counted as cooccurrenceThis will reduce noises sincethe flows are filtered by constrained clustering

After signature matching constrained 119870-means cluster-ing is carried out to identify the remaining unknown flowsAlgorithm 2 shows the constrained clustering algorithmexploited in AppFA In the constrained flow clustering theflows identified by signature matching which belong to thesame app must be clustered into one cluster (must-link con-straints lines (4)ndash(8) in Algorithm 2) and those generatedby different apps must be clustered into different clusters(cannot-link constraints lines (9)ndash(13) in Algorithm 2) Theclustering features are listed in Table 3 (totally 11 features)Weselect the time of the first packet sent as one of the featuresbecause flows observed within short time intervals are likelyto come from the same app [32]The other features are chosenas they are proved to be efficient in clustering network traffic[36]

In Algorithm 2 we set the value of cluster number 119901 to beequal to the number of apps identified by signaturematchingThis is because the accuracy of mobile app identification isalready higher than 95 [29 32] and it is proper to assumethat the popular apps (malware writers usually graft somemalicious code on popular apps to ensure a wide diffusion

6 Security and Communication Networks

Flow sessionspacket featuresand contents

SignatureMatching

ConstrainedClustering

App sessions

App 1

App 2

Figure 4 Illustration of the basic idea of app identification Solid and dotted lines represent different apps and black color indicates that theflows have not been identified

(1) let 119878 denotes signature set and 119878119889 stands for the signatureseeds

(2) while the clustering signal is received do(3) if 119878 is empty then(4) 119878 = 119878119889(5) end if(6) carry out signature matching(7) carry out constrained flow clustering(8) update 119878 with the clustering results(9) end while

Algorithm 1 App traffic clustering procedure

Table 3 Identification feature set

Time of the first packet sentNumber of packetsVolume of bytesMin max mean and std dev of packet sizeMin max mean and std dev of interpacket time

Table 4 Network behavior profile feature set

Number of flowsNumber of packets of outgoing and incoming flowsVolume of bytes of outgoing and incoming flowsFirst 1198961 results of KPCA transformations of outgoing packet sizesFirst 1198962 results of KPCA transformations of incoming packet sizesFirst 1198961 results of KPCA transformations of outgoing packetintervalsFirst 1198962 results of KPCA transformations of incoming packetintervals

of their malicious code) can all be recognized Thereforeeach cluster will correspond to one app when the constrained119870-means clustering is finished This is appropriate for thefollowing network behavior profile construction and mali-cious behavior detection In practice if the actual numberof apps is greater than 119901 namely some apps cannot beidentified by signature matching the corresponding flowswill be misclustered This may change network behaviors ofapps since unrelated flows will be included and the traffic

features such as the number of packets and the volume ofbytes will be enlarged In this work we use Kernel PrincipalComponent Analysis to remit this problem as described inthe following subsection

52 Network Behavior Profile Construction After the appidentification the network behavior profiles for apps areconstructed from app sessions In AppFA we define the apprsquosnetwork behavior profile as a set of chosen network trafficfeatures as listed in Table 4 Formally the network behaviorprofile can be defined as follows

119899119890119905119861119890ℎ119886V119894119900119903119875119903119900119891119894119897119890 fl 1198911198901198861199051199061199031198901 119891119890119886119905119906119903119890119899 119891119890119886119905119906119903119890119894 isin 119888119900119899119899119890119888119905119894119900119899 119891119890119886119905119906119903119890119904 119889119886119905119886 119891119890119886119905119906119903119890119904

(1)

Since malicious apps need to establish network connec-tions to transmit confidential data or carry out defined attacksteps [18] in this work we mainly choose connection featuresand data features for constructing appsrsquo network behaviorprofiles as shown in (1) The connection features describehow many network connections have been established andthe data features represent characteristics of packets Thewhole selected features for consisting of appsrsquo network behav-ior profiles are listed in Table 4 In the table the first two linesare connection features and the rest are data features

Furthermore for overcoming the misclassification prob-lem (as illustrated in Section 51) and distinguishing minortraffic variations from significant differences we do notuse these features directly Instead KPCA (Kernel PrincipalComponent Analysis) is applied to transform basic featuressuch as packet size and packet time interval Basically KPCA

Security and Communication Networks 7

Input Data set 119865 cluster number 119901 must-link constraints119871119898 sube 119865 times 119865 cannot-link constraints 119871 119888 sube 119865 times 119865

Output Flow clusters(1) Let 1198881 sdot sdot sdot 119888119901 be the initial cluster centers(2) for each flow 119891119894 in 119865 do(3) select the closest cluster 119888(4) for each (119891119894 119891) isin 119871119898 do(5) if119891 notin 119888 then(6) goto step (2)(7) end if(8) end for(9) for each (119891119894 119891) isin 119871 119888 do(10) if119891 isin 119888 then(11) goto step (2)(12) end if(13) end for(14) assign 119891119894 to the cluster 119888(15) end for(16) for each cluster 119888119894 do(17) update its center by averaging all of the flows 119891119895 that

have been assigned to it(18) end for(19) iterate between step (2) and step (18) until convergence(20) return 1198881 sdot sdot sdot 119888119901

Algorithm 2 Constrained flow clustering algorithm

is one approach of generalizing linear PCA into nonlinearcase using the kernel method It has been proved that KPCAhas the best performance in feature extraction and is robust tonoise [37] The details of KPCA used in AppFA are describedas follows

First Gaussian function defined in (2) is selected as thekernel function

119892 (119909 119910) = exp (minus120574 1003817100381710038171003817119909 minus 11991010038171003817100381710038172) (2)

where the value of 120574 is set to 0001 as indicated in [37]Then we compute a Gramkernel matrix 119870 with

119870119894119895 = 119892 (119909(119894) 119909(119895)) (3)

Next the kernel matrix 119870 is centered via the followingfunction

119870centered = 119870 minus 1119873119870 minus 1198701119873 + 1119873119870= (119868 minus 1119873) 119870 (119868 minus 1119873)

(4)

where 1119873 is an 119873 times 119873 matrix with all elements equal to 1119873and 119873 is the number of data points

After that the nonzero eigenvalues 120582119894 and the eigenvec-tors 120572119894 of the centered kernel matrix 119870centered are calculated asfollows

119870centered120572119894 = 120582119894120572119894 (5)

Also the eigenvectors 120572119894 are normalized as

120572119894 =1

radic120582119894119873120572119894 (6)

Finally we sort the eigenvectors in the descending orderof corresponding eigenvalues and perform projections ontothe given subset of eigenvectors This step can be representedas follows

119889119895 =119873

sum119894=1

120572119895119894119892 (119909 119909119894) 119895 = 1 2 119896 (7)

where 119896 is the dimension of the new dataFor each app the length of its network behavior profile is

(1 + 2 lowast (119891119900 + 119891119894) + 2 lowast (1198961 + 1198962)) and 119891119900 is the number ofoutgoing flows and 119891119894 is the number of incoming flows

53 Malicious App Detection With appsrsquo network behaviorprofiles we propose to use peer group analysis [38] to detectmalicious apps The main idea of the detection method isillustrated in Figure 5 An apprsquos network behavior profileis compared with both its historical data and the profilesof its peer group for malware detection Comparison withthe historical profiles can help us find out self-updatingmalicious apps [7] and comparison with the profiles of itspeer group can detect repackagedmalicious apps as observedin Section 2

For AppFA an apprsquos peer group is defined as the set of itsbehavior-similar apps Apps are considered behavior-similarif they satisfy the following two conditions A their mainfunctionality is similar for example all for mailing B theirnetwork behavior profiles are similar If any of the above twoconditions is not met the peer group will be empty and wewill only compare appsrsquo profiles with their historical data

8 Security and Communication Networks

App1 for detection

Update the peergroup of App1

Detection

App1 lt12 100 5000 gt

Network behaviorprofile comparison

Network behaviorprofile comparison

Historical data Peer grouplt11 110 5020 gt

lt10 130 4600 gt

lt12 100 5000 gt

lt12 101 5010 gt

e profile is similar to thehistory App1 has not beeninjected in malicious code

App2 lt20 400 6000 gt

App3 lt30 500 7000 gt

App4 lt25 450 6500 gt

App5 lt19 370 5580 gt

e profile is significantly differentfrom its peer group (behavior-similar apps) App1 may be a fakeapp

Figure 5 Illustration of the procedure of malicious app detection The numbers in angle brackets are examples of features listed in Table 4

Figure 6 Example of similar apps recommended by Google Play

In practice one can resort to app stores such as GooglePlay to find out candidates that will meet condition AFigure 6 shows that when viewing the details of an appGoogle Play will recommend the similar ones These similarapps are determined by several features such as category ofapps keywords in the title and description and size of the apk(httpswwwquoracomHow-does-the-Google-Play-Store-determine-similar-apps) Generally these recommendedapps belong to the same category and have similar function-ality Therefore in order to determine the peer group of anapp we first use the similar apps recommended by app storessuch as Google Play as candidates and later filter them byconditionB

For condition B the Euclidean distance is used tomeasure the similarity of network behavior profiles Suppose119860 is the app to be analyzed and 119861 is the app satisfying con-dition A for 119860 The similarity of network behavior profilesbetween 119860 and 119861 is calculated as

119889 (119860 119861)

= radic(119899119861119875 (119860) minus 119899119861119875 (119861)) (119899119861119875 (119860) minus 119899119861119875 (119861))119879(8)

where 119899119861119875 is the abbreviation of 119899119890119905119861119890ℎ119886V119894119900119903119875119903119900119891119894119897119890 definedin (1)

Once the similarities between 119860 and the recommendedapps are calculated by (8) the results are sorted in the orderof increasing distance (ie decreasing similarity) The first119892 apps will be selected as the peer group members of 119860 asshown in (9) The optimal value of 119892 can be determined bythe experiments which is discussed in the next section Notethat the peer group members can be updated For example ifone of the peer groupmembers has been flagged asmaliciousit will be removed from the peer group and a new one will beadded AppFA can also reselect the peer groups every 119879 timeinterval

peerGroup (119860) fl app1 app119892

where 119889 (119860 app119894) ge 119889 (119860 app119895) if 119894 lt 119895(9)

As illustrated in Figures 2 and 5 for an identified appAppFA first constructs its network behavior profile and then

Security and Communication Networks 9

compares the profile with both its historical profiles andthe profiles of its peer groups for malicious apps detectionDenote 119900 as the profile of the identified app 119860 and 119866 as thematrix of the compared profiles (historical profiles or peergroup profiles) Each column of 119866 is one profile and thecolumn length is 119897 (119897 = (1 + 2 lowast (119891119900 + 119891119894) + 2 lowast (1198961 + 1198962)))AppFAmakes the feature vectors the same length by paddingzeros

For peer group analysis firstly 119900 and 119866 are normalized bymin-max normalization procedure defined in

119900119894 =119900119894

max (119900119894 1198661198941 1198661198942 119866119894119892)

119866119894119895 =119866119894119895

max (119900119894 1198661198941 1198661198942 119866119894119892)

119894 = 1 2 119897 119895 = 1 2 119892

(10)

Then the distance between 119900 and 119866 is calculated byMahalanobis distance

120591 = radic(119900 minus 119898)119879 119878minus1 (119900 minus 119898) (11)

where 119898 is the weighted mean vector of profiles

1198981198941 =119892

sum119895=1

119908119860119895119866119894119895 1 le 119894 le 119897 (12)

and 119878 is the covariance matrix and it is calculated in

119878 = 119864 (119866 minus 119864 [119866]) (119866 minus 119864 [119866])119879 (13)

In (12) 119908119860119895 is the weight of the 119895th closest peer groupmember of the analyzed app 119860 The weights of peer groupmembers are obtained from their proximity to 119860 In detailwe define the proximity of the 119895th closest peer groupmemberof the target app 119860 in

prox119860119895 = exp (minus119889 (119860 119895)) (14)

where 119889(119860 119895) is the Euclidean distance between app 119860 and its119895th closest peer group member defined in (8) Based on theproximity measure we defined above the weight of the 119895thclosest peer group member of app 119860 is defined in

119908119860119895 =prox119860119895

sum119892119895=1 prox119860119895(15)

Finally the state of apps is judged as follows

120591 gt 119905119904 malicious app

120591 le 119905119904 normal app(16)

where 119905119904 is the threshold and it can be set to different valuesby network administrators based on the actual network con-dition We evaluated different values of 119905119904 in our experiment

Internet

ServerUbuntu 1604

functions

Testing mobiledevices(HUAWEI andMI Phones)

(i) Access point(ii) Traffic collection

(iii) Data analysis

Samples(apps)

Figure 7 Experimental setup for the detection of mobile maliciousapps

6 Evaluation

61 Data Collection

Experimental Setup The experimental setup used for AppFAis shown in Figure 7 We have implemented a prototypeof AppFA with the help of ourmon [39] and CCCG [40]Ourmon is an open-source networkmonitoring and anomalydetection system and CCCG is a general framework forconstrained clustering A Ubuntu 1604 computer has beenconfigured as the access point For packet capturing thesmart phone is connected to the Internet by WIFI andnetwork traffic is collected at the access point by tcpdump(httpwwwtcpdumporg) Each pcap file is fixed up to100MB When packet capture is complete the TCP andUPD flows are split by SplitCap (httpswwwnetreseccompage=SplitCap) from pcap files For appsrsquo traffic clusteringand network behavior profile construction we use tshark(httpswwwwiresharkorgdocsman-pagestsharkhtml) toextract IP address packet sizes and packet interval timesfrom network flows After that a Python program is writtento calculate the statistical features as listed in Tables 3 and 4The KPCA transformation is accomplished with the help ofscikit-learn (httpscikit-learnorg)The data analysis is alsocompleted in theUbuntu 1604 computer (with 4GBmemoryand Pentium Dual-Core CPU T4500)

TrafficGenerationWe use the publicMalGenome [19] datasetin our experiments Since we mainly focus on repackag-ing and updating malwares 93 typical information collectionmalwares including repackaging and updating attack typesare selected from MalGenome to test the detection rate ofAppFA (the malicious apps dataset was downloaded fromhttpwwwmalgenomeprojectorg in 2015 These selectedmalwares are the ones run without any errors) We haveinstalled all these malicious apps in HUAWEI Honor 8 and aMI Note 4 phones and run these apps one by one 50 times tocollect network traffic In detail these malwares are analyzedand run by GroddDroid [41] to make sure that maliciouscodes will be triggered We also use GroddDroid to runother malwares besides the selected ones and their trafficwill be mainly used in the local detection as described in

10 Security and Communication Networks

Section 63 Particularly the background traffic includingweather forecasting email checking QQ and Webchat (QQand Webchat are the most famous apps in China) is allowedin our experiments The ground truth of the originator ofnetwork traffic is determined by Packet Capture (httpsplaygooglecomstoreappsdetailsid=appgreyshirtssslcaptureamphl=zh) to test the accuracy of app traffic clustering

For testing the false positive rates of AppFA we rely onthe GooglePlayAppsCrawlerpy project to identify the 100most popular free apps These 100 free apps have also beenrun one by one 50 times While different from malwaresthe benign apps are run by droidbot [42] to make sure togenerate enough traffic With these captured traffic we takeup all nonzero packets for app traffic clustering and networkbehavior profile construction Namely in our experimentsthe value of 119899 in Packet filter model is set to minus1

The data used in the experiments is summarized inTable 5 TrafficSet 1 consists of network traffic generated bythe selected 93 malwares and TrafficSet 2 consists of networktraffic of benign apps TrafficSet 3 is the traffic of all malwaresthus it includes TrafficSet 1 TrafficSet 3 is mainly used formalicious apps detection in local networks

62 Experimental Results

Accuracy of Traffic Clustering TrafficSet 1 is used for testingthe accuracy of app traffic clustering and Packet Capture isexploited to get the ground truth of the originator of networktraffic Recall that in Algorithm 1 app traffic clustering mustbe started with signature seeds In order to obtain appsrsquosignature seeds we randomly choose several HTTP flowsfor each app and extract the key-value pairs in their HTTPheaders as signature seeds as described in Section 4 Thevalue of cluster number 119901 in Algorithm 2 is set to 97 (97 =93 + 4) since there are 93 malwares and 4 background appsthat is weather forecasting email checking QQ andWebchatThe experimental results are shown in Table 6

As shown in Table 6 for each app when 5 HTTP flowsare chosen to generate signature seeds only 601 flows (in all223220 flows) aremisclusteredThis proves the efficiencies ofour proposed method for app traffic clustering

Experimental Results of Malicious App Detection Similarto previous work in order to measure the effectiveness ofmalicious apps detection accuracy (detection rate) and falsepositive rate metrics are defined in (17) and (18) where TPFN FP and TN stand for true positive false negative falsepositive and true negative respectively

Accuracy = TPTP + FN

(17)

False positive rate = FPFP + TN

(18)

We first examine the detection rates and false positiverates of AppFA with different values of 119905119904 and 119896119894 (119894 = 1 2)For the selected malicious apps the apps with the samefunctionality (satisfying condition A) are determined byGoogle Play We first analyze the malicious appsrsquo function-alities manually and choose a keyword for each app Then

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

05

06

07

08

09

1

Det

ectio

n ra

te

06 08 1 12 14 16 18 2 04ts

Figure 8 Detection rates of AppFA with different values of 119905119904 and119896119894 (119894 = 1 2)

we search the keyword in Google Play and use the returnedapps as candidates After that the peer group is determined by(8) and (9) In our experiments the largest value of membercount 119892 is set to 10 Note that 119905119904 is the threshold defined in(16) and 119896119894 is the number of KPCA transformations definedin Table 4 We set the member count 119892 in peer group analysisto 5 namely for each tested app its top 5 similar apps areselected and analyzedThedetection rate is shown in Figure 8and the false positive rate is depicted in Figure 9As illustratedin Figures 8 and 9 the first 4 KPCA transformations of packetsizes and intervals are good enough for constructing networkbehavior profileWhen 1198961 = 1198962 = 4 and 119905119904 gt 12 the detectionrate is higher than 90 and the false positive rate is lowerthan 04 The detection rate is as high as 97 when 119905119904 = 2The experimental results demonstrate the effectiveness of ourproposed approach

We then test AppFA with different value of 119892 The valuesof 119905119904 and 119896119894 (119894 = 1 2) are set to 2 and 4 respectivelyThe experimental results are shown in Figures 10 and 11 Asillustrated in these figures when 119892 increases the detectionrates and the false positive rates are slightly changed Thiscan be explained as the Mahalanobis distance 120591 defined in(11) is calculated by considering the weighted mean vector ofprofiles So the most similar app has the largest weight andthe added peers will have smaller weights and thus providelittle contribution to the final detection As observed in ourexperiments 119892 = 5 is suitable for practical deployment

Further AppFA with different value of cluster number 119901is also tested Note that the actual cluster number is 97 forTrafficSet 1 In order to evaluate how app traffic clusteringimpacts the detection of repackaged malware we slightlychange the value of 119901 and the experiments results are asshown in Figure 12 In the experiments we set 1198961 = 1198962 = 4119905119904 = 2 and 119892 = 5 As shown in Figure 12 the cluster numberindeed impacts the detection rate of AppFA Particularlywhen 119901 = 99 the detection rate is only 88 namely only 82

Security and Communication Networks 11

Table 5 Data summary

Traffic set Apps Quantity Number of collectedflows (TCP and UPD) Amount of data (bytes)

TrafficSet 1 Selected malicious apps(for testing accuracy of traffic clustering and detection rates) 93 223220 538G

TrafficSet 2 Benign apps(for testing false positive rates of AppFA) 100 305103 640G

TrafficSet 3 All malicious apps(for malicious apps detection in local networks) 1260 asymp31 million 1872G

Table 6 Experimental results of traffic clustering

Number of HTTP flows forgenerating signature seeds

Number of flowscorrectly clustered

Number ofmisclustered

flows2 201357 218633 218946 42744 221384 18365 222619 601

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

00005

0001

00015

0002

00025

0003

00035

0004

False

pos

itive

rate

06 08 1 12 14 16 18 2 04ts

Figure 9 False positive rates of AppFA with different values of 119905119904and 119896119894(119894 = 1 2)

malicious apps are correctly detected However when 119901 lt 97(119901 = 95 and 96 in Figure 12) the detection rates are almost thesame namely 97Therefore AppFAwill be competent witha few unknown apps when carrying out app traffic clustering

Finally we evaluate AppFA with the remaining maliciousapps besides the selected 93 samplesThe rest of themaliciousapps contain repackaged and other types of malwares Thesemalwares are found to have some errors when run on ourphones However they also produced considerable networktraffic We test AppFA with these apps for examining thegeneral capability of AppFA The parameters are set up asfollows 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 The experimentalresults are shown in Table 7 The detection rate is 734and the false positive rate is 35 Compared to the results

1 3 5 7 10g

Detection rate

0

01

02

03

04

05

06

07

08

09

1

rate

()

Figure 10 Detection rates of AppFA with different values of 119892

1 3 5 7 10g

False positive rate

times10minus3

0

1

2

3

4

5

6

7

rate

()

Figure 11 False positive rates of AppFA with different values of 119892

displayed in Figures 8 9 10 and 11 the detection rate issignificantly reduced and the false positive rate is increasedThe possible reasons will be discussed in Section 64

Note that AppFA is designed to perform nearly real-timemalicious apps detection thus efficiency is also a big concernTable 8 presents the computational performance for major

12 Security and Communication Networks

99989695value of p

0

01

02

03

04

05

06

07

08

09

1

dete

ctio

n ra

te (

)

Figure 12 Detection rates of AppFA with different values of 119901

Table 7 Experimental results for the malicious apps besides theselected 93 samples

Detection rate False positive rate734 35

Table 8 The computational performance

Procedures TimeConstrained flow clustering (Algorithm 1) 3minApp identification (Algorithm 2) 52minNetwork behavior profile construction 17minPeer group analysis 45 sec

steps in terms of average running time using the experimentaldata Generally the proposedmethod can be deployed onlineto detect malicious apps Specifically it is fast typically lessthan one minute to exploit peer group analysis to performmalicious app detection The app identification procedureis the most time-consuming step because the signaturematching and constrained flow clustering are carried out oneach flow Parallel computing such as cloud resources can befurther used to speed the malicious apps detection and scaleup to more Internet traffic data

Detection of Malicious Repackaged Apps for Encrypted TrafficAs shown in Table 4 AppFA does not consider traffic contentand mainly uses traffic statistical features to detect maliciousAndroid apps Thus AppFA can deal with encrypted trafficSince in MalGenome there is no app whose all connectionsto attack servers are encrypted we have created a newAndroid repackaged malware that communicates with allmalicious servers by encrypted connections In detail wefirst modify AnserverBotrsquos source code and make sure itsnetwork connectionswill be encrypted byTLS protocolThenwe graft it into a popular game app aircomaceviralmotox3Finally we collect network traffic of the repackaged aircomaceviralmotox3 and its peer group In the experiments

Table 9 Experimental results when searching similar apps locally

Detection rate False positive rate692 47

the selected peer group of the repackaged aircomaceviralmotox3 is comwordmobilesbikeRacing comtopfreegamesbikeracefreeworld comskgamestrafficrider comtomicowheel-iechallenge and madmotoracingbike We repeat to detectthe repackaged aircomaceviralmotox3 10 times and thedetection rate is 100 The results show that AppFA canhandle encrypted network traffic

63 Detection in Local Networks In the above experimentsthe peer group for each app is determinedwith the help of appstore such as theGoogle PlayThatmeansAppFAhas to accessthe Internet when performing malicious apps detection Infact AppFA can also be enhanced to choose the appsrsquo peergroups locally namely from the set of already identified appsNote that AppFA uses signature matching to identify appson the network and the identified ones can be treated as amini app store By this AppFA can work on local networksTo find out similar apps from the already identified ones anefficient method based on information retrieval technologiesis proposed and is shown in Algorithm 3

The basic idea of Algorithm 3 is that similar apps mayhave similar functionalities and the context (keywords) maybe similarTherefore we useweb-page searching technologiesto match similar app The 119872119886119905119888ℎ119894119899119892 function in line (4) canbe realized by TF-IDF (Term Frequency-Inverse DocumentFrequency) [43] Again peer groups can be chose by (8) withthe results returned by Algorithm 3

We have implemented the local similar app searching andevaluated AppFArsquos performance in local networks All appsincluding all malicious and benign apps are considered asidentified apps and Algorithm 3 is carried out to find outsimilar apps When 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 for the93 typical repackaged malwares the experimental results arelisted in Table 9 The detection rate is 692 lower than 97in Figure 8The experimental results indicate that AppFA canwork in local networks and the proper peer groups do affectthe final detection significantly

64 Discussion Figures 8 9 10 and 11 show that AppFAgets high detection rates and low false positive rates whendetecting repackaged apps However the detection rate maybecome lower when it is applied to other types of maliciousapps such as sending SMS without notification This may bedue to the fact that other types of mobile malwares transmitless data through network than repackaged apps Note thatthe detection features used inAppFA aremainly related to thepackets sizes and the number of flows So the detection ratemay be declined if the appsrsquo network behaviors are slightlychanged

Compared to the latest work done in [6] AppFA has aslightly lower detection rate (the detection rate is 95ndash999 asreported in [6]) However Garg et alrsquos method [6] consideredonly 18 malware apps and 14 genuine apps and assumed

Security and Communication Networks 13

(1) extract all plaintext keywords from HTTP headers of thetested app 119860 Denote the set of keywords as 119882(119860)

(2) extract all plaintext keywords from HTTP headers of otheridentified apps Denote the set of keywords of identifiedapp 119894 as 119882(119894) 119894 = 0 1 119899 minus 1

(3) while 0 le 119899 do(4) 119878119888119900119903119890119904[119894]V119886119897119906119890 = 119872119886119905119888ℎ119894119899119892(119882(119860) 119882(119894))(5) 119878119888119900119903119890119904[119894]119886119901119901 = 119894(6) 119894 + +(7) end while(8) sorting 119878119888119900119903119890119904[] with 119878119888119900119903119890119904[119894]V119886119897119906119890 by decreasing order(9) return the first 119898119878119888119900119903119890119904[119894]119886119901119901 119894 = 0 1 119898 minus 1

Algorithm 3 Searching similar apps locally

Table 10 Experimental results when the fact that what traffic isgenerated by what app is exactly known In the experiments 1198961 =1198962 = 4 119905119904 = 2 and 119892 = 5

Selected maliciousapps

Remaining maliciousapps

Detection rate 983 85False positiverate 003 27

that the traffic generated by apps was already known (theirdata were collected on the mobile device so their methodis actually a kind of client-side approach) In our workmuch larger samples are taken into account and constrainedclustering is exploited to determine to what app the flowsbelong The accuracy of clustering is also a factor affectingthe detection rate To confirm this we further take the sameassumption as Garg et alrsquos method [6] namely we assumethe accuracy of Algorithm 1 is 100 and know exactly whattraffic is generated by what app (in our experiments we useapp Packet Capture to obtain perfect ground truth of whatflows came from what app) At this time the experimentalresults are given in Table 10 Apparently the detection ratesare much improved

In practice one can deploy both our method and othermethods such as [6 15] simultaneously As indicated by ourexperiments AppFA is efficient in detecting repackaged andself-updatingmalicious apps In fact [15] is suitable forHTTPanalysis and [6] can detect CampC communication efficientlySo our work can be a complement of existing work onmalicious app detection

In this work AppFA mainly uses the statistical features(refer to Table 4) of network traffic to detect Android mali-cious repackaged applications Therefore attackers (repack-aged apps) may change the characteristics of their traffic toremain undetectable But it is quite difficult in practice Asillustrated in Section 2 repackaged apps usually introduceadditional network traffic thus the attackers must removesome normal network connections to keep network behav-iors the same to evade detection However the removal of

normal network connections will impact the functionalitiesof apps and may cause errors This gives additional clues todetect repackaged malware Meanwhile the disappearanceof some network connections may be abnormal as wellTherefore our proposed method is hard to evade

7 Conclusion

In this paper we propose a novel approach AppFA to detectmalicious apps at the network level In AppFA apps are firstidentified from network traffic by signature matching andconstrained clustering Then Kernel Principal ComponentAnalysis is employed to construct app network behavior pro-file and distinguish minor traffic variations from significantdifferences At last we take advantage of peer group analysisto detect malicious apps to avoid time-consuming offlinemodel training Notably AppFA does not need to installprograms or modify operating systems to collect featureinformation Thus it is very convenient to be used and thecost is low The experimental results show that AppFA candetect Android repackaged malware with the detection ratehigher than 90 and a false positive rate lower than 04

The appsrsquo network behaviorsmay be significantly changedby version update and thus cause false positives This needsto be further investigated In the future work AppFA willbe extended to include more network traffic features anddetect more types of malicious apps We will also makeAppFA an open-source project to facilitate further researchesof malicious app detection

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by National Natural Science Founda-tion of China under Grants 61702282 and 61502250 NaturalScience Foundation of the Jiangsu Higher Education Insti-tutions of China under Grant 17KJB520023 NUPTSF underGrant NY217143 and Nanjing Forestry University (GXL016CX2016026)

14 Security and Communication Networks

References

[1] Android statistics Number of android applications September2017 httpwwwappbraincomstatsnumber-of-android-apps

[2] A Mylonas M Theoharidou and D Gritzalis ldquoAssessingPrivacy Risks in Android A User-Centric Approachrdquo in RiskAssessment and Risk-Driven Testing Lecture Notes in ComputerScience pp 21ndash37 Springer International Publishing Cham2014

[3] Does Does your mobile anti-virus app protect or infect you thetruth behind du antivirus security September 2017 httpsresearchcheckpointcommobile-anti-virus-app-protect-infect-truth-behind-du-antivirus-security

[4] A P Felt M Finifter E Chin S Hanna and D Wagner ldquoAsurvey of mobile malware in the wildrdquo in Proceedings of the 1stACM Workshop on Security and Privacy in Smartphones andMobile Devices (SPSM rsquo11) held in Association with the 18th ACMConference on Computer and Communications Security (CCSrsquo11) pp 3ndash14 October 2011

[5] K Tam A Feizollah N B Anuar R Salleh and L Caval-laro ldquoThe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys vol 49 no 4 Article ID3017427 2017

[6] S Garg S K Peddoju and A K Sarje ldquoNetwork-based detec-tion of Android malicious appsrdquo International Journal of Infor-mation Security pp 1ndash16 2016

[7] A Shabtai L Tenenboim-Chekina D Mimran L Rokach BShapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquo Com-puters amp Security vol 43 pp 1ndash18 2014

[8] L Xie X Zhang J-P Seifert and S Zhu ldquoPBMDS a behavior-based malware detection system for cellphone devicesrdquo inProceedings of the 3rd ACM Conference on Wireless NetworkSecurity (WiSec rsquo10) pp 37ndash48 Hoboken NJ USAMarch 2010

[9] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacy inSmartphones and Mobile Devices (SPSM rsquo11) Held in Associationwith the 18th ACM Conference on Computer and Communica-tions Security (CCS rsquo11) pp 15ndash25 October 2011

[10] GDini FMartinelli A Saracino andD Sgandurra ldquoMADAMa multi-level anomaly detector for android malwarerdquo in Com-puter Network Security vol 7531 of Lecture Notes in ComputerScience pp 240ndash253 Springer Berlin Germany 2012

[11] Y Zhang M Yang B Xu et al ldquoVetting undesirable behaviorsin android apps with permission use analysisrdquo in Proceedingsof the in Proceedings of the 2013 ACM SIGSAC conference onComputer amp communications security pp 611ndash622 2013

[12] S Wang Z Chen X Li L Wang K Ji and C Zhao ldquoAndroidMalware Clustering Analysis on Network-Level Behaviorrdquo inIntelligent Computing Theories and Application vol 10361 ofLecture Notes in Computer Science pp 796ndash807 SpringerInternational Publishing Cham Switzerland 2017

[13] A S Shamili C Bauckhage and T Alpcan ldquoMalware detectionon mobile devices using distributed machine learningrdquo inProceedings of the 20th International Conference on PatternRecognition (ICPR rsquo10) pp 4348ndash4351 August 2010

[14] Y Wang J Wei and K Vangury ldquoBring your own device secu-rity issues and challengesrdquo in Proceedings of the 2014 IEEE 11thConsumer Communications and Networking Conference CCNC2014 pp 80ndash85 USA January 2014

[15] P S Chen S-C Lin and C-H Sun ldquoSimple and effectivemethod for detecting abnormal internet behaviors of mobiledevicesrdquo Information Sciences vol 321 Article ID 11536 pp 193ndash204 2015

[16] Global mobile os market share in sales to end users from 1stquarter 2009 to 1st quarter 2017 2017 httpswwwstatistacomstatistics266136global-market-share-held-by-smartphone-op-eratingsystems

[17] ldquoIt threat evolution q3 2017 statisticsrdquo November 2017 httpssecurelistcomit-threat-evolution-q3-2017-statistics83131

[18] L Li D Li T F Bissyande et al ldquoUnderstanding Android AppPiggybacking A Systematic Study of Malicious Code GraftingrdquoIEEE Transactions on Information Forensics and Security vol 12no 6 pp 1269ndash1284 2017

[19] Y Zhou and X Jiang ldquoDissecting android malware characteri-zation and evolutionrdquo in Proceedings of the 33rd IEEE Sympo-sium on Security and Privacy pp 95ndash109 San Francisco CalifUSA May 2012

[20] P Faruki V Laxmi A Bharmal M S Gaur and V Gan-moor ldquoAndroSimilar Robust signature for detecting variantsof Android malwarerdquo Journal of Information Security andApplications vol 22 pp 66ndash80 2015

[21] Y Zhou and X JiangAn analysis of the anserverbot trojan 2011httpwwwcscncsuedufacultyjiangpubsAnserverBot Ana-lysispdf

[22] P Faruki A Bharmal V Laxmi et al ldquoAndroid security a sur-vey of issues malware penetration and defensesrdquo IEEE Com-munications Surveys amp Tutorials vol 17 no 2 pp 998ndash10222015

[23] P Yan and Z Yan ldquoA survey on dynamic mobile malwaredetectionrdquo Software Quality Journal pp 1ndash29 2017

[24] S Dai Y Liu T Wang T Wei and W Zou ldquoBehavior-basedmalware detection on mobile phonerdquo in Proceedings of the 20106th International Conference on Wireless Communications Net-working and Mobile Computing WiCOM 2010 China Septem-ber 2010

[25] D Damopoulos G Kambourakis S Gritzalis and S O ParkldquoExposing mobile malware from the inside (or what is yourmobile app really doing)rdquo Peer-to-Peer Networking and Appli-cations vol 7 no 4 pp 687ndash697 2014

[26] D Damopoulos G Kambourakis and G Portokalidis ldquoThebest of both worlds A framework for the synergistic operationof host and cloud anomaly-based IDS for smartphonesrdquo inProceedings of the 7th European Workshop on System SecurityEuroSec 2014 Netherlands April 2014

[27] Q Xu J Erman A Gerber Z Mao J Pang and S Venkatara-man ldquoIdentifying diverse usage behaviors of smartphone appsrdquoin Proceedings of the ACM SIGCOMM Internet MeasurementConference (IMC rsquo11) pp 329ndash344 November 2011

[28] S Dai A Tongaonkar X Wang A Nucci and D Song ldquoNet-workProfiler Towards automatic fingerprinting of Androidappsrdquo in Proceedings of the 32nd IEEE Conference on ComputerCommunications IEEE INFOCOM 2013 pp 809ndash817 ItalyApril 2013

[29] Q Xu T Andrews Y Liao et al ldquoFlowr A self-learning systemfor classifying mobile application trafficrdquo in Proceedings of the2014 ACM SIGMETRICS International Conference on Measure-ment and Modeling of Computer Systems SIGMETRICS 2014pp 569-570 usa June 2014

[30] Q Xu Y Liao S Miskovic et al ldquoAutomatic generation ofmobile app signatures from traffic observationsrdquo in Proceedings

Security and Communication Networks 15

of the 34th IEEE Annual Conference on Computer Communica-tions and Networks IEEE INFOCOM 2015 pp 1481ndash1489 HongKong May 2015

[31] J Sun L She H Chen et al ldquoAutomatically identifying appsin mobile trafficrdquo Concurrency Computation vol 28 no 14 pp3927ndash3941 2016

[32] G He B Xu and H Zhu ldquoIdentifying Mobile Applications forEncrypted Network Trafficrdquo in Proceedings of the 5th Interna-tional Conference on Advanced Cloud and Big Data CBD 2017pp 279ndash284 China August 2017

[33] G He M Yang X Gu J Luo and Y Ma ldquoA novel activewebsite fingerprinting attack against Tor anonymous systemrdquoin Proceedings of the 2014 18th IEEE International Conference onComputer Supported CooperativeWork inDesign CSCWD2014pp 112ndash117 Taiwan May 2014

[34] G He M Yang J Luo and X Gu ldquoA novel application classi-fication attack against Torrdquo Concurrency and ComputationPractice and Experience vol 27 no 18 pp 5640ndash5661 2015

[35] B Miller L Huang A D Joseph and J D Tygar ldquoI know whyyou went to the clinic Risks and realization of HTTPS trafficanalysisrdquo Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes in Bio-informatics) Preface vol 8555 pp 143ndash163 2014

[36] Y Wang Y Xiang J Zhang W Zhou G Wei and L T YangldquoInternet traffic classification using constrained clusteringrdquoIEEE Transactions on Parallel and Distributed Systems vol 25no 11 pp 2932ndash2943 2014

[37] L J Cao K S Chua W K Chong H P Lee and Q M Gu ldquoAcomparison of PCA KPCA and ICA for dimensionality reduc-tion in support vector machinerdquo Neurocomputing vol 55 no1-2 pp 321ndash336 2003

[38] Y Kim and S Y Sohn ldquoStock fraud detection using peer groupanalysisrdquo Expert Systems with Applications vol 39 no 10 pp8986ndash8992 2012

[39] ldquoOurmon network monitoring and anomaly detection systemrdquoApril 2013 httpourmonsourceforgenet

[40] B Babaki T Guns and S Nijssen ldquoConstrained ClusteringUsing Column Generationrdquo in Integration of AI and OR Tech-niques in Constraint Programming vol 8451 of Lecture Notes inComputer Science pp 438ndash454 Springer International Publish-ing Cham 2014

[41] A Abraham R Andriatsimandefitra A Brunelat J-F Lalandeand V Viet Triem Tong ldquoGroddDroid A gorilla for triggeringmalicious behaviorsrdquo in Proceedings of the 10th InternationalConference on Malicious and Unwanted Software MALWARE2015 pp 119ndash127 USA October 2015

[42] Y Li Z Yang Y Guo and X Chen ldquoDroidBot A lightweightUI-guided test input generator for androidrdquo inProceedings of the39th IEEEACM International Conference on Software Engineer-ing Companion ICSE-C 2017 pp 23ndash26 Argentina May 2017

[43] A Aizawa ldquoAn information-theoretic perspective of tf-idfmeasuresrdquo Information Processing amp Management vol 39 no1 pp 45ndash65 2003

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 4: AppFA: A Novel Approach to Detect Malicious …downloads.hindawi.com/journals/scn/2018/2854728.pdfSecurityandCommunicationNetworks F :Exampleofkey-valuepairs. signs, and the packet

4 Security and Communication Networks

Packet Filter SessionBuilder

FlowSession

FlowSession

PacketContentExtractor

Basic FeatureExtractor

IdentificationFeatureGenerator

App TrafficClustering

ProfileFeatureGenerator

AppBehaviorProfileConstructor

MaliciousAppDetection

Appsession

Figure 2 Main components of AppFA

while their detection features were obtained in the mobiledevice and they needed to train classification models offlineIn AppFA we straightforwardly extract detection features onthe network to build appsrsquo network behavior profiles and usepeer group analysis to avoid offline model training

Recently there are an increasing number of researchworks that analyze network traffic to identify app such as[27ndash32] However most of them were focused on plaintextflows (eg HTTP) and tried to collect identification featuresfrom HTTP headers These methods may fail due to theemergence of encrypted network traffic In this paper weinvestigate a new constrained clusteringmethod to cluster thenetwork traffic generated by the same app

4 System Design

The system architecture of AppFA is shown in Figure 2 It isof modular design and each module has special functionsPacket Filtermodule captures link-layer frames or reads themfrom a file and filters them according to configurable rules Inorder to detect malicious apps online and provide support fornetwork management AppFA is designed to analyze the first119899 nonzero packets (packets contain application data) of eachnetwork flow The parameter 119899 is configurable for differentnetwork management purpose For example the value of 119899can be set to a small positive number such as 50 for real-timemalicious app detection and minus1 for full analysis namely con-sidering all nonzero packets in flows In AppFA the packetfilter module can be implemented based on well-knownlibrary such as libpcap (httpwwwtcpdumporgrelease)

Session Builder module organizes network traffic intosessions For app identification and malicious behavior

detectionwe define two types of sessionsflow session and appsession The flow session is defined by the source IP sourceport destination IP destination port transport protocoltuple where source and destination can be swapped andthe transport protocol is mainly considered as TCP andUDP in this work In deployment AppFA determines whena flow session is completed by one of the following threeconditions (1) received 119899 nonzero packets (2) detectingRESTFINpacket (3) timeout for example there is no packetexchanging in 3 minutes With flow sessions one can extractbasic features and packet contents efficiently for app identifi-cation and malicious behavior detection The app session isdefined as the collection set of all flow sessions gen-erated by the same app flow session 1 flow session 2 flow session 119898 The app sessions constructed by SessionBuilder module will be invoked by Profile Feature Generatormodule to construct app network behavior profiles asdepicted in the right of Figure 2

Basic Feature Extractor module extracts basic packetfeatures from flow sessions which include packet sizepacket interarrival time packet order and packet directionThe packet direction feature distinguishes outgoing fromincoming packets Other advanced features such as flowduration total packets sent and received and burst sizescommonly used in traffic analysis [33ndash35] can be calculatedfrom these basic features So we extract these basic featuresfirstly and then generate appropriate advanced features (iden-tification or detection features) for different traffic analysispurposes (clustering or peer group comparing) After thebasic feature extraction a flowmay look like +50 lowast30 +100lowast500 minus1300 lowast20 minus400 where outgoing packet sizes andincoming packet sizes are denoted by positive and negative

Security and Communication Networks 5

Figure 3 Example of key-value pairs

signs and the packet interarrival times are labelled byasterisk The packet order features are also reflected by thenumber sequences inherently For example for the flow+50 lowast30 +100 lowast500 minus1300 lowast20 minus400 the first packetis an outgoing packet whose size is 50 bytes and the secondpacket is also an outgoing packet and the packet time intervalis 30ms The third packet is an incoming packet whose sizeis 1300 bytes The time interval between the second and thethird packet is 500ms and so forth

Note that the packet contents are equally saved throughPacket Content Extractor module for app identification Sim-ilar to work [30] we focus on key-value pairs in HTTP head-ers In detail justniffer (httpjustniffersourceforgenet) isfirst used to transform the raw packet traffic into HTTPmessages Then HTTP messages are tokenized by severaltokenizers such as space ldquornrdquo ldquordquo and ldquoamprdquo and eachHTTPrequest will be broken into various parts including methodpage and query Finally queries are divided into key-valuepairs Figure 3 is an example of key-value pairs used in ourexperiments

After obtaining the basic features from flows we beginto generate identification features in Identification FeatureGeneratormodule After that wewill identify the app for eachflow through App Traffic Clustering module which returnsthe clustering results to Session Builder module for formingapp sessionsWith the app session information detection fea-tures will be created in Profile Feature Generator module Atlast app network behavior profiles will be constructed inAppBehavior Profile Constructor module and malicious apps willbe detected inMalicious App Detectionmodule respectivelyThe independent treatment of different functional modulescan make AppFA architecture clear and scalable

5 Methodology

As depicted in Figure 2 the functionality of app trafficclustering and malicious app detection (including identifica-tiondetection feature generation) are the core componentsof AppFA and the technical details are clarified in the nextsubsections

51 App Traffic Clustering The basic idea of our app trafficclustering method is illustrated in Figure 4 In the figurethere are two apps and the flow sessions of them are repre-sented by solid and dotted lines respectively For each flowsignature matching is first used to identify HTTP flows thathave recognized signatures Note that in this paper we usethe term signature for plaintext matching and feature for

traffic analysis With this step there are two flows identified(labelled as red and green) as shown in the left dashed box ofFigure 4 After the signaturematching constrained clusteringalgorithm (the second dashed box in Figure 4) is exploitedto cluster all flows Compared to the ordinary clusteringalgorithms constrained clustering algorithm adopts back-ground information (ie identified flows) to improve clusteraccuracy By constrained clustering flows such as encryptedtraffic that cannot be identified by signatures will be classifiedinto appropriate clusters that is apps Finally we obtainapp sessions that will be utilized for creating appsrsquo networkbehavior profiles as depicted in the last step in Figure 4

Formally the entire app identification procedure isdescribed in Algorithm 1 In the while loop the clusteringsignal can be a timeout for cyclic identification (eg every1 hour) or an app session completed for real-time identifica-tion For the details of selecting the initial signature seeds 119878119889one can refer to literature [30]

Algorithm 1 uses a method similar to the FLOWR system[30] to carry out signature matching (line (6)) and takesadvantage of constrained 119870-means clustering algorithm [36]for flow clustering (line (7)) For signature matching thekey-value pairs in HTTP header are considered as apprsquossignatures and an initial set of seeding app signatures isset up to bootstrap the learning of new ones Comparedto FLOWR we refine the process of counting cooccurrenceof app signatures with the constrained clustering results InFLOWR if start time of 119891119897o1199082 is less than 119879 seconds afterthe start time of 1198911198971199001199081 their signatures will be consideredas a cooccurrence instance However as noted in literature[30] if119879 is overestimated FLOWR ismore likely tomix flowsfrom different apps thus inducing noise and overutilizingsystem resources To overcome this problem in AppFA wefurther consider the clustering results to count cooccurrenceof app signatures besides temporal information That meansonly the flows that occurred in 119879 seconds in the same clusterwill be counted as cooccurrenceThis will reduce noises sincethe flows are filtered by constrained clustering

After signature matching constrained 119870-means cluster-ing is carried out to identify the remaining unknown flowsAlgorithm 2 shows the constrained clustering algorithmexploited in AppFA In the constrained flow clustering theflows identified by signature matching which belong to thesame app must be clustered into one cluster (must-link con-straints lines (4)ndash(8) in Algorithm 2) and those generatedby different apps must be clustered into different clusters(cannot-link constraints lines (9)ndash(13) in Algorithm 2) Theclustering features are listed in Table 3 (totally 11 features)Weselect the time of the first packet sent as one of the featuresbecause flows observed within short time intervals are likelyto come from the same app [32]The other features are chosenas they are proved to be efficient in clustering network traffic[36]

In Algorithm 2 we set the value of cluster number 119901 to beequal to the number of apps identified by signaturematchingThis is because the accuracy of mobile app identification isalready higher than 95 [29 32] and it is proper to assumethat the popular apps (malware writers usually graft somemalicious code on popular apps to ensure a wide diffusion

6 Security and Communication Networks

Flow sessionspacket featuresand contents

SignatureMatching

ConstrainedClustering

App sessions

App 1

App 2

Figure 4 Illustration of the basic idea of app identification Solid and dotted lines represent different apps and black color indicates that theflows have not been identified

(1) let 119878 denotes signature set and 119878119889 stands for the signatureseeds

(2) while the clustering signal is received do(3) if 119878 is empty then(4) 119878 = 119878119889(5) end if(6) carry out signature matching(7) carry out constrained flow clustering(8) update 119878 with the clustering results(9) end while

Algorithm 1 App traffic clustering procedure

Table 3 Identification feature set

Time of the first packet sentNumber of packetsVolume of bytesMin max mean and std dev of packet sizeMin max mean and std dev of interpacket time

Table 4 Network behavior profile feature set

Number of flowsNumber of packets of outgoing and incoming flowsVolume of bytes of outgoing and incoming flowsFirst 1198961 results of KPCA transformations of outgoing packet sizesFirst 1198962 results of KPCA transformations of incoming packet sizesFirst 1198961 results of KPCA transformations of outgoing packetintervalsFirst 1198962 results of KPCA transformations of incoming packetintervals

of their malicious code) can all be recognized Thereforeeach cluster will correspond to one app when the constrained119870-means clustering is finished This is appropriate for thefollowing network behavior profile construction and mali-cious behavior detection In practice if the actual numberof apps is greater than 119901 namely some apps cannot beidentified by signature matching the corresponding flowswill be misclustered This may change network behaviors ofapps since unrelated flows will be included and the traffic

features such as the number of packets and the volume ofbytes will be enlarged In this work we use Kernel PrincipalComponent Analysis to remit this problem as described inthe following subsection

52 Network Behavior Profile Construction After the appidentification the network behavior profiles for apps areconstructed from app sessions In AppFA we define the apprsquosnetwork behavior profile as a set of chosen network trafficfeatures as listed in Table 4 Formally the network behaviorprofile can be defined as follows

119899119890119905119861119890ℎ119886V119894119900119903119875119903119900119891119894119897119890 fl 1198911198901198861199051199061199031198901 119891119890119886119905119906119903119890119899 119891119890119886119905119906119903119890119894 isin 119888119900119899119899119890119888119905119894119900119899 119891119890119886119905119906119903119890119904 119889119886119905119886 119891119890119886119905119906119903119890119904

(1)

Since malicious apps need to establish network connec-tions to transmit confidential data or carry out defined attacksteps [18] in this work we mainly choose connection featuresand data features for constructing appsrsquo network behaviorprofiles as shown in (1) The connection features describehow many network connections have been established andthe data features represent characteristics of packets Thewhole selected features for consisting of appsrsquo network behav-ior profiles are listed in Table 4 In the table the first two linesare connection features and the rest are data features

Furthermore for overcoming the misclassification prob-lem (as illustrated in Section 51) and distinguishing minortraffic variations from significant differences we do notuse these features directly Instead KPCA (Kernel PrincipalComponent Analysis) is applied to transform basic featuressuch as packet size and packet time interval Basically KPCA

Security and Communication Networks 7

Input Data set 119865 cluster number 119901 must-link constraints119871119898 sube 119865 times 119865 cannot-link constraints 119871 119888 sube 119865 times 119865

Output Flow clusters(1) Let 1198881 sdot sdot sdot 119888119901 be the initial cluster centers(2) for each flow 119891119894 in 119865 do(3) select the closest cluster 119888(4) for each (119891119894 119891) isin 119871119898 do(5) if119891 notin 119888 then(6) goto step (2)(7) end if(8) end for(9) for each (119891119894 119891) isin 119871 119888 do(10) if119891 isin 119888 then(11) goto step (2)(12) end if(13) end for(14) assign 119891119894 to the cluster 119888(15) end for(16) for each cluster 119888119894 do(17) update its center by averaging all of the flows 119891119895 that

have been assigned to it(18) end for(19) iterate between step (2) and step (18) until convergence(20) return 1198881 sdot sdot sdot 119888119901

Algorithm 2 Constrained flow clustering algorithm

is one approach of generalizing linear PCA into nonlinearcase using the kernel method It has been proved that KPCAhas the best performance in feature extraction and is robust tonoise [37] The details of KPCA used in AppFA are describedas follows

First Gaussian function defined in (2) is selected as thekernel function

119892 (119909 119910) = exp (minus120574 1003817100381710038171003817119909 minus 11991010038171003817100381710038172) (2)

where the value of 120574 is set to 0001 as indicated in [37]Then we compute a Gramkernel matrix 119870 with

119870119894119895 = 119892 (119909(119894) 119909(119895)) (3)

Next the kernel matrix 119870 is centered via the followingfunction

119870centered = 119870 minus 1119873119870 minus 1198701119873 + 1119873119870= (119868 minus 1119873) 119870 (119868 minus 1119873)

(4)

where 1119873 is an 119873 times 119873 matrix with all elements equal to 1119873and 119873 is the number of data points

After that the nonzero eigenvalues 120582119894 and the eigenvec-tors 120572119894 of the centered kernel matrix 119870centered are calculated asfollows

119870centered120572119894 = 120582119894120572119894 (5)

Also the eigenvectors 120572119894 are normalized as

120572119894 =1

radic120582119894119873120572119894 (6)

Finally we sort the eigenvectors in the descending orderof corresponding eigenvalues and perform projections ontothe given subset of eigenvectors This step can be representedas follows

119889119895 =119873

sum119894=1

120572119895119894119892 (119909 119909119894) 119895 = 1 2 119896 (7)

where 119896 is the dimension of the new dataFor each app the length of its network behavior profile is

(1 + 2 lowast (119891119900 + 119891119894) + 2 lowast (1198961 + 1198962)) and 119891119900 is the number ofoutgoing flows and 119891119894 is the number of incoming flows

53 Malicious App Detection With appsrsquo network behaviorprofiles we propose to use peer group analysis [38] to detectmalicious apps The main idea of the detection method isillustrated in Figure 5 An apprsquos network behavior profileis compared with both its historical data and the profilesof its peer group for malware detection Comparison withthe historical profiles can help us find out self-updatingmalicious apps [7] and comparison with the profiles of itspeer group can detect repackagedmalicious apps as observedin Section 2

For AppFA an apprsquos peer group is defined as the set of itsbehavior-similar apps Apps are considered behavior-similarif they satisfy the following two conditions A their mainfunctionality is similar for example all for mailing B theirnetwork behavior profiles are similar If any of the above twoconditions is not met the peer group will be empty and wewill only compare appsrsquo profiles with their historical data

8 Security and Communication Networks

App1 for detection

Update the peergroup of App1

Detection

App1 lt12 100 5000 gt

Network behaviorprofile comparison

Network behaviorprofile comparison

Historical data Peer grouplt11 110 5020 gt

lt10 130 4600 gt

lt12 100 5000 gt

lt12 101 5010 gt

e profile is similar to thehistory App1 has not beeninjected in malicious code

App2 lt20 400 6000 gt

App3 lt30 500 7000 gt

App4 lt25 450 6500 gt

App5 lt19 370 5580 gt

e profile is significantly differentfrom its peer group (behavior-similar apps) App1 may be a fakeapp

Figure 5 Illustration of the procedure of malicious app detection The numbers in angle brackets are examples of features listed in Table 4

Figure 6 Example of similar apps recommended by Google Play

In practice one can resort to app stores such as GooglePlay to find out candidates that will meet condition AFigure 6 shows that when viewing the details of an appGoogle Play will recommend the similar ones These similarapps are determined by several features such as category ofapps keywords in the title and description and size of the apk(httpswwwquoracomHow-does-the-Google-Play-Store-determine-similar-apps) Generally these recommendedapps belong to the same category and have similar function-ality Therefore in order to determine the peer group of anapp we first use the similar apps recommended by app storessuch as Google Play as candidates and later filter them byconditionB

For condition B the Euclidean distance is used tomeasure the similarity of network behavior profiles Suppose119860 is the app to be analyzed and 119861 is the app satisfying con-dition A for 119860 The similarity of network behavior profilesbetween 119860 and 119861 is calculated as

119889 (119860 119861)

= radic(119899119861119875 (119860) minus 119899119861119875 (119861)) (119899119861119875 (119860) minus 119899119861119875 (119861))119879(8)

where 119899119861119875 is the abbreviation of 119899119890119905119861119890ℎ119886V119894119900119903119875119903119900119891119894119897119890 definedin (1)

Once the similarities between 119860 and the recommendedapps are calculated by (8) the results are sorted in the orderof increasing distance (ie decreasing similarity) The first119892 apps will be selected as the peer group members of 119860 asshown in (9) The optimal value of 119892 can be determined bythe experiments which is discussed in the next section Notethat the peer group members can be updated For example ifone of the peer groupmembers has been flagged asmaliciousit will be removed from the peer group and a new one will beadded AppFA can also reselect the peer groups every 119879 timeinterval

peerGroup (119860) fl app1 app119892

where 119889 (119860 app119894) ge 119889 (119860 app119895) if 119894 lt 119895(9)

As illustrated in Figures 2 and 5 for an identified appAppFA first constructs its network behavior profile and then

Security and Communication Networks 9

compares the profile with both its historical profiles andthe profiles of its peer groups for malicious apps detectionDenote 119900 as the profile of the identified app 119860 and 119866 as thematrix of the compared profiles (historical profiles or peergroup profiles) Each column of 119866 is one profile and thecolumn length is 119897 (119897 = (1 + 2 lowast (119891119900 + 119891119894) + 2 lowast (1198961 + 1198962)))AppFAmakes the feature vectors the same length by paddingzeros

For peer group analysis firstly 119900 and 119866 are normalized bymin-max normalization procedure defined in

119900119894 =119900119894

max (119900119894 1198661198941 1198661198942 119866119894119892)

119866119894119895 =119866119894119895

max (119900119894 1198661198941 1198661198942 119866119894119892)

119894 = 1 2 119897 119895 = 1 2 119892

(10)

Then the distance between 119900 and 119866 is calculated byMahalanobis distance

120591 = radic(119900 minus 119898)119879 119878minus1 (119900 minus 119898) (11)

where 119898 is the weighted mean vector of profiles

1198981198941 =119892

sum119895=1

119908119860119895119866119894119895 1 le 119894 le 119897 (12)

and 119878 is the covariance matrix and it is calculated in

119878 = 119864 (119866 minus 119864 [119866]) (119866 minus 119864 [119866])119879 (13)

In (12) 119908119860119895 is the weight of the 119895th closest peer groupmember of the analyzed app 119860 The weights of peer groupmembers are obtained from their proximity to 119860 In detailwe define the proximity of the 119895th closest peer groupmemberof the target app 119860 in

prox119860119895 = exp (minus119889 (119860 119895)) (14)

where 119889(119860 119895) is the Euclidean distance between app 119860 and its119895th closest peer group member defined in (8) Based on theproximity measure we defined above the weight of the 119895thclosest peer group member of app 119860 is defined in

119908119860119895 =prox119860119895

sum119892119895=1 prox119860119895(15)

Finally the state of apps is judged as follows

120591 gt 119905119904 malicious app

120591 le 119905119904 normal app(16)

where 119905119904 is the threshold and it can be set to different valuesby network administrators based on the actual network con-dition We evaluated different values of 119905119904 in our experiment

Internet

ServerUbuntu 1604

functions

Testing mobiledevices(HUAWEI andMI Phones)

(i) Access point(ii) Traffic collection

(iii) Data analysis

Samples(apps)

Figure 7 Experimental setup for the detection of mobile maliciousapps

6 Evaluation

61 Data Collection

Experimental Setup The experimental setup used for AppFAis shown in Figure 7 We have implemented a prototypeof AppFA with the help of ourmon [39] and CCCG [40]Ourmon is an open-source networkmonitoring and anomalydetection system and CCCG is a general framework forconstrained clustering A Ubuntu 1604 computer has beenconfigured as the access point For packet capturing thesmart phone is connected to the Internet by WIFI andnetwork traffic is collected at the access point by tcpdump(httpwwwtcpdumporg) Each pcap file is fixed up to100MB When packet capture is complete the TCP andUPD flows are split by SplitCap (httpswwwnetreseccompage=SplitCap) from pcap files For appsrsquo traffic clusteringand network behavior profile construction we use tshark(httpswwwwiresharkorgdocsman-pagestsharkhtml) toextract IP address packet sizes and packet interval timesfrom network flows After that a Python program is writtento calculate the statistical features as listed in Tables 3 and 4The KPCA transformation is accomplished with the help ofscikit-learn (httpscikit-learnorg)The data analysis is alsocompleted in theUbuntu 1604 computer (with 4GBmemoryand Pentium Dual-Core CPU T4500)

TrafficGenerationWe use the publicMalGenome [19] datasetin our experiments Since we mainly focus on repackag-ing and updating malwares 93 typical information collectionmalwares including repackaging and updating attack typesare selected from MalGenome to test the detection rate ofAppFA (the malicious apps dataset was downloaded fromhttpwwwmalgenomeprojectorg in 2015 These selectedmalwares are the ones run without any errors) We haveinstalled all these malicious apps in HUAWEI Honor 8 and aMI Note 4 phones and run these apps one by one 50 times tocollect network traffic In detail these malwares are analyzedand run by GroddDroid [41] to make sure that maliciouscodes will be triggered We also use GroddDroid to runother malwares besides the selected ones and their trafficwill be mainly used in the local detection as described in

10 Security and Communication Networks

Section 63 Particularly the background traffic includingweather forecasting email checking QQ and Webchat (QQand Webchat are the most famous apps in China) is allowedin our experiments The ground truth of the originator ofnetwork traffic is determined by Packet Capture (httpsplaygooglecomstoreappsdetailsid=appgreyshirtssslcaptureamphl=zh) to test the accuracy of app traffic clustering

For testing the false positive rates of AppFA we rely onthe GooglePlayAppsCrawlerpy project to identify the 100most popular free apps These 100 free apps have also beenrun one by one 50 times While different from malwaresthe benign apps are run by droidbot [42] to make sure togenerate enough traffic With these captured traffic we takeup all nonzero packets for app traffic clustering and networkbehavior profile construction Namely in our experimentsthe value of 119899 in Packet filter model is set to minus1

The data used in the experiments is summarized inTable 5 TrafficSet 1 consists of network traffic generated bythe selected 93 malwares and TrafficSet 2 consists of networktraffic of benign apps TrafficSet 3 is the traffic of all malwaresthus it includes TrafficSet 1 TrafficSet 3 is mainly used formalicious apps detection in local networks

62 Experimental Results

Accuracy of Traffic Clustering TrafficSet 1 is used for testingthe accuracy of app traffic clustering and Packet Capture isexploited to get the ground truth of the originator of networktraffic Recall that in Algorithm 1 app traffic clustering mustbe started with signature seeds In order to obtain appsrsquosignature seeds we randomly choose several HTTP flowsfor each app and extract the key-value pairs in their HTTPheaders as signature seeds as described in Section 4 Thevalue of cluster number 119901 in Algorithm 2 is set to 97 (97 =93 + 4) since there are 93 malwares and 4 background appsthat is weather forecasting email checking QQ andWebchatThe experimental results are shown in Table 6

As shown in Table 6 for each app when 5 HTTP flowsare chosen to generate signature seeds only 601 flows (in all223220 flows) aremisclusteredThis proves the efficiencies ofour proposed method for app traffic clustering

Experimental Results of Malicious App Detection Similarto previous work in order to measure the effectiveness ofmalicious apps detection accuracy (detection rate) and falsepositive rate metrics are defined in (17) and (18) where TPFN FP and TN stand for true positive false negative falsepositive and true negative respectively

Accuracy = TPTP + FN

(17)

False positive rate = FPFP + TN

(18)

We first examine the detection rates and false positiverates of AppFA with different values of 119905119904 and 119896119894 (119894 = 1 2)For the selected malicious apps the apps with the samefunctionality (satisfying condition A) are determined byGoogle Play We first analyze the malicious appsrsquo function-alities manually and choose a keyword for each app Then

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

05

06

07

08

09

1

Det

ectio

n ra

te

06 08 1 12 14 16 18 2 04ts

Figure 8 Detection rates of AppFA with different values of 119905119904 and119896119894 (119894 = 1 2)

we search the keyword in Google Play and use the returnedapps as candidates After that the peer group is determined by(8) and (9) In our experiments the largest value of membercount 119892 is set to 10 Note that 119905119904 is the threshold defined in(16) and 119896119894 is the number of KPCA transformations definedin Table 4 We set the member count 119892 in peer group analysisto 5 namely for each tested app its top 5 similar apps areselected and analyzedThedetection rate is shown in Figure 8and the false positive rate is depicted in Figure 9As illustratedin Figures 8 and 9 the first 4 KPCA transformations of packetsizes and intervals are good enough for constructing networkbehavior profileWhen 1198961 = 1198962 = 4 and 119905119904 gt 12 the detectionrate is higher than 90 and the false positive rate is lowerthan 04 The detection rate is as high as 97 when 119905119904 = 2The experimental results demonstrate the effectiveness of ourproposed approach

We then test AppFA with different value of 119892 The valuesof 119905119904 and 119896119894 (119894 = 1 2) are set to 2 and 4 respectivelyThe experimental results are shown in Figures 10 and 11 Asillustrated in these figures when 119892 increases the detectionrates and the false positive rates are slightly changed Thiscan be explained as the Mahalanobis distance 120591 defined in(11) is calculated by considering the weighted mean vector ofprofiles So the most similar app has the largest weight andthe added peers will have smaller weights and thus providelittle contribution to the final detection As observed in ourexperiments 119892 = 5 is suitable for practical deployment

Further AppFA with different value of cluster number 119901is also tested Note that the actual cluster number is 97 forTrafficSet 1 In order to evaluate how app traffic clusteringimpacts the detection of repackaged malware we slightlychange the value of 119901 and the experiments results are asshown in Figure 12 In the experiments we set 1198961 = 1198962 = 4119905119904 = 2 and 119892 = 5 As shown in Figure 12 the cluster numberindeed impacts the detection rate of AppFA Particularlywhen 119901 = 99 the detection rate is only 88 namely only 82

Security and Communication Networks 11

Table 5 Data summary

Traffic set Apps Quantity Number of collectedflows (TCP and UPD) Amount of data (bytes)

TrafficSet 1 Selected malicious apps(for testing accuracy of traffic clustering and detection rates) 93 223220 538G

TrafficSet 2 Benign apps(for testing false positive rates of AppFA) 100 305103 640G

TrafficSet 3 All malicious apps(for malicious apps detection in local networks) 1260 asymp31 million 1872G

Table 6 Experimental results of traffic clustering

Number of HTTP flows forgenerating signature seeds

Number of flowscorrectly clustered

Number ofmisclustered

flows2 201357 218633 218946 42744 221384 18365 222619 601

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

00005

0001

00015

0002

00025

0003

00035

0004

False

pos

itive

rate

06 08 1 12 14 16 18 2 04ts

Figure 9 False positive rates of AppFA with different values of 119905119904and 119896119894(119894 = 1 2)

malicious apps are correctly detected However when 119901 lt 97(119901 = 95 and 96 in Figure 12) the detection rates are almost thesame namely 97Therefore AppFAwill be competent witha few unknown apps when carrying out app traffic clustering

Finally we evaluate AppFA with the remaining maliciousapps besides the selected 93 samplesThe rest of themaliciousapps contain repackaged and other types of malwares Thesemalwares are found to have some errors when run on ourphones However they also produced considerable networktraffic We test AppFA with these apps for examining thegeneral capability of AppFA The parameters are set up asfollows 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 The experimentalresults are shown in Table 7 The detection rate is 734and the false positive rate is 35 Compared to the results

1 3 5 7 10g

Detection rate

0

01

02

03

04

05

06

07

08

09

1

rate

()

Figure 10 Detection rates of AppFA with different values of 119892

1 3 5 7 10g

False positive rate

times10minus3

0

1

2

3

4

5

6

7

rate

()

Figure 11 False positive rates of AppFA with different values of 119892

displayed in Figures 8 9 10 and 11 the detection rate issignificantly reduced and the false positive rate is increasedThe possible reasons will be discussed in Section 64

Note that AppFA is designed to perform nearly real-timemalicious apps detection thus efficiency is also a big concernTable 8 presents the computational performance for major

12 Security and Communication Networks

99989695value of p

0

01

02

03

04

05

06

07

08

09

1

dete

ctio

n ra

te (

)

Figure 12 Detection rates of AppFA with different values of 119901

Table 7 Experimental results for the malicious apps besides theselected 93 samples

Detection rate False positive rate734 35

Table 8 The computational performance

Procedures TimeConstrained flow clustering (Algorithm 1) 3minApp identification (Algorithm 2) 52minNetwork behavior profile construction 17minPeer group analysis 45 sec

steps in terms of average running time using the experimentaldata Generally the proposedmethod can be deployed onlineto detect malicious apps Specifically it is fast typically lessthan one minute to exploit peer group analysis to performmalicious app detection The app identification procedureis the most time-consuming step because the signaturematching and constrained flow clustering are carried out oneach flow Parallel computing such as cloud resources can befurther used to speed the malicious apps detection and scaleup to more Internet traffic data

Detection of Malicious Repackaged Apps for Encrypted TrafficAs shown in Table 4 AppFA does not consider traffic contentand mainly uses traffic statistical features to detect maliciousAndroid apps Thus AppFA can deal with encrypted trafficSince in MalGenome there is no app whose all connectionsto attack servers are encrypted we have created a newAndroid repackaged malware that communicates with allmalicious servers by encrypted connections In detail wefirst modify AnserverBotrsquos source code and make sure itsnetwork connectionswill be encrypted byTLS protocolThenwe graft it into a popular game app aircomaceviralmotox3Finally we collect network traffic of the repackaged aircomaceviralmotox3 and its peer group In the experiments

Table 9 Experimental results when searching similar apps locally

Detection rate False positive rate692 47

the selected peer group of the repackaged aircomaceviralmotox3 is comwordmobilesbikeRacing comtopfreegamesbikeracefreeworld comskgamestrafficrider comtomicowheel-iechallenge and madmotoracingbike We repeat to detectthe repackaged aircomaceviralmotox3 10 times and thedetection rate is 100 The results show that AppFA canhandle encrypted network traffic

63 Detection in Local Networks In the above experimentsthe peer group for each app is determinedwith the help of appstore such as theGoogle PlayThatmeansAppFAhas to accessthe Internet when performing malicious apps detection Infact AppFA can also be enhanced to choose the appsrsquo peergroups locally namely from the set of already identified appsNote that AppFA uses signature matching to identify appson the network and the identified ones can be treated as amini app store By this AppFA can work on local networksTo find out similar apps from the already identified ones anefficient method based on information retrieval technologiesis proposed and is shown in Algorithm 3

The basic idea of Algorithm 3 is that similar apps mayhave similar functionalities and the context (keywords) maybe similarTherefore we useweb-page searching technologiesto match similar app The 119872119886119905119888ℎ119894119899119892 function in line (4) canbe realized by TF-IDF (Term Frequency-Inverse DocumentFrequency) [43] Again peer groups can be chose by (8) withthe results returned by Algorithm 3

We have implemented the local similar app searching andevaluated AppFArsquos performance in local networks All appsincluding all malicious and benign apps are considered asidentified apps and Algorithm 3 is carried out to find outsimilar apps When 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 for the93 typical repackaged malwares the experimental results arelisted in Table 9 The detection rate is 692 lower than 97in Figure 8The experimental results indicate that AppFA canwork in local networks and the proper peer groups do affectthe final detection significantly

64 Discussion Figures 8 9 10 and 11 show that AppFAgets high detection rates and low false positive rates whendetecting repackaged apps However the detection rate maybecome lower when it is applied to other types of maliciousapps such as sending SMS without notification This may bedue to the fact that other types of mobile malwares transmitless data through network than repackaged apps Note thatthe detection features used inAppFA aremainly related to thepackets sizes and the number of flows So the detection ratemay be declined if the appsrsquo network behaviors are slightlychanged

Compared to the latest work done in [6] AppFA has aslightly lower detection rate (the detection rate is 95ndash999 asreported in [6]) However Garg et alrsquos method [6] consideredonly 18 malware apps and 14 genuine apps and assumed

Security and Communication Networks 13

(1) extract all plaintext keywords from HTTP headers of thetested app 119860 Denote the set of keywords as 119882(119860)

(2) extract all plaintext keywords from HTTP headers of otheridentified apps Denote the set of keywords of identifiedapp 119894 as 119882(119894) 119894 = 0 1 119899 minus 1

(3) while 0 le 119899 do(4) 119878119888119900119903119890119904[119894]V119886119897119906119890 = 119872119886119905119888ℎ119894119899119892(119882(119860) 119882(119894))(5) 119878119888119900119903119890119904[119894]119886119901119901 = 119894(6) 119894 + +(7) end while(8) sorting 119878119888119900119903119890119904[] with 119878119888119900119903119890119904[119894]V119886119897119906119890 by decreasing order(9) return the first 119898119878119888119900119903119890119904[119894]119886119901119901 119894 = 0 1 119898 minus 1

Algorithm 3 Searching similar apps locally

Table 10 Experimental results when the fact that what traffic isgenerated by what app is exactly known In the experiments 1198961 =1198962 = 4 119905119904 = 2 and 119892 = 5

Selected maliciousapps

Remaining maliciousapps

Detection rate 983 85False positiverate 003 27

that the traffic generated by apps was already known (theirdata were collected on the mobile device so their methodis actually a kind of client-side approach) In our workmuch larger samples are taken into account and constrainedclustering is exploited to determine to what app the flowsbelong The accuracy of clustering is also a factor affectingthe detection rate To confirm this we further take the sameassumption as Garg et alrsquos method [6] namely we assumethe accuracy of Algorithm 1 is 100 and know exactly whattraffic is generated by what app (in our experiments we useapp Packet Capture to obtain perfect ground truth of whatflows came from what app) At this time the experimentalresults are given in Table 10 Apparently the detection ratesare much improved

In practice one can deploy both our method and othermethods such as [6 15] simultaneously As indicated by ourexperiments AppFA is efficient in detecting repackaged andself-updatingmalicious apps In fact [15] is suitable forHTTPanalysis and [6] can detect CampC communication efficientlySo our work can be a complement of existing work onmalicious app detection

In this work AppFA mainly uses the statistical features(refer to Table 4) of network traffic to detect Android mali-cious repackaged applications Therefore attackers (repack-aged apps) may change the characteristics of their traffic toremain undetectable But it is quite difficult in practice Asillustrated in Section 2 repackaged apps usually introduceadditional network traffic thus the attackers must removesome normal network connections to keep network behav-iors the same to evade detection However the removal of

normal network connections will impact the functionalitiesof apps and may cause errors This gives additional clues todetect repackaged malware Meanwhile the disappearanceof some network connections may be abnormal as wellTherefore our proposed method is hard to evade

7 Conclusion

In this paper we propose a novel approach AppFA to detectmalicious apps at the network level In AppFA apps are firstidentified from network traffic by signature matching andconstrained clustering Then Kernel Principal ComponentAnalysis is employed to construct app network behavior pro-file and distinguish minor traffic variations from significantdifferences At last we take advantage of peer group analysisto detect malicious apps to avoid time-consuming offlinemodel training Notably AppFA does not need to installprograms or modify operating systems to collect featureinformation Thus it is very convenient to be used and thecost is low The experimental results show that AppFA candetect Android repackaged malware with the detection ratehigher than 90 and a false positive rate lower than 04

The appsrsquo network behaviorsmay be significantly changedby version update and thus cause false positives This needsto be further investigated In the future work AppFA willbe extended to include more network traffic features anddetect more types of malicious apps We will also makeAppFA an open-source project to facilitate further researchesof malicious app detection

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by National Natural Science Founda-tion of China under Grants 61702282 and 61502250 NaturalScience Foundation of the Jiangsu Higher Education Insti-tutions of China under Grant 17KJB520023 NUPTSF underGrant NY217143 and Nanjing Forestry University (GXL016CX2016026)

14 Security and Communication Networks

References

[1] Android statistics Number of android applications September2017 httpwwwappbraincomstatsnumber-of-android-apps

[2] A Mylonas M Theoharidou and D Gritzalis ldquoAssessingPrivacy Risks in Android A User-Centric Approachrdquo in RiskAssessment and Risk-Driven Testing Lecture Notes in ComputerScience pp 21ndash37 Springer International Publishing Cham2014

[3] Does Does your mobile anti-virus app protect or infect you thetruth behind du antivirus security September 2017 httpsresearchcheckpointcommobile-anti-virus-app-protect-infect-truth-behind-du-antivirus-security

[4] A P Felt M Finifter E Chin S Hanna and D Wagner ldquoAsurvey of mobile malware in the wildrdquo in Proceedings of the 1stACM Workshop on Security and Privacy in Smartphones andMobile Devices (SPSM rsquo11) held in Association with the 18th ACMConference on Computer and Communications Security (CCSrsquo11) pp 3ndash14 October 2011

[5] K Tam A Feizollah N B Anuar R Salleh and L Caval-laro ldquoThe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys vol 49 no 4 Article ID3017427 2017

[6] S Garg S K Peddoju and A K Sarje ldquoNetwork-based detec-tion of Android malicious appsrdquo International Journal of Infor-mation Security pp 1ndash16 2016

[7] A Shabtai L Tenenboim-Chekina D Mimran L Rokach BShapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquo Com-puters amp Security vol 43 pp 1ndash18 2014

[8] L Xie X Zhang J-P Seifert and S Zhu ldquoPBMDS a behavior-based malware detection system for cellphone devicesrdquo inProceedings of the 3rd ACM Conference on Wireless NetworkSecurity (WiSec rsquo10) pp 37ndash48 Hoboken NJ USAMarch 2010

[9] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacy inSmartphones and Mobile Devices (SPSM rsquo11) Held in Associationwith the 18th ACM Conference on Computer and Communica-tions Security (CCS rsquo11) pp 15ndash25 October 2011

[10] GDini FMartinelli A Saracino andD Sgandurra ldquoMADAMa multi-level anomaly detector for android malwarerdquo in Com-puter Network Security vol 7531 of Lecture Notes in ComputerScience pp 240ndash253 Springer Berlin Germany 2012

[11] Y Zhang M Yang B Xu et al ldquoVetting undesirable behaviorsin android apps with permission use analysisrdquo in Proceedingsof the in Proceedings of the 2013 ACM SIGSAC conference onComputer amp communications security pp 611ndash622 2013

[12] S Wang Z Chen X Li L Wang K Ji and C Zhao ldquoAndroidMalware Clustering Analysis on Network-Level Behaviorrdquo inIntelligent Computing Theories and Application vol 10361 ofLecture Notes in Computer Science pp 796ndash807 SpringerInternational Publishing Cham Switzerland 2017

[13] A S Shamili C Bauckhage and T Alpcan ldquoMalware detectionon mobile devices using distributed machine learningrdquo inProceedings of the 20th International Conference on PatternRecognition (ICPR rsquo10) pp 4348ndash4351 August 2010

[14] Y Wang J Wei and K Vangury ldquoBring your own device secu-rity issues and challengesrdquo in Proceedings of the 2014 IEEE 11thConsumer Communications and Networking Conference CCNC2014 pp 80ndash85 USA January 2014

[15] P S Chen S-C Lin and C-H Sun ldquoSimple and effectivemethod for detecting abnormal internet behaviors of mobiledevicesrdquo Information Sciences vol 321 Article ID 11536 pp 193ndash204 2015

[16] Global mobile os market share in sales to end users from 1stquarter 2009 to 1st quarter 2017 2017 httpswwwstatistacomstatistics266136global-market-share-held-by-smartphone-op-eratingsystems

[17] ldquoIt threat evolution q3 2017 statisticsrdquo November 2017 httpssecurelistcomit-threat-evolution-q3-2017-statistics83131

[18] L Li D Li T F Bissyande et al ldquoUnderstanding Android AppPiggybacking A Systematic Study of Malicious Code GraftingrdquoIEEE Transactions on Information Forensics and Security vol 12no 6 pp 1269ndash1284 2017

[19] Y Zhou and X Jiang ldquoDissecting android malware characteri-zation and evolutionrdquo in Proceedings of the 33rd IEEE Sympo-sium on Security and Privacy pp 95ndash109 San Francisco CalifUSA May 2012

[20] P Faruki V Laxmi A Bharmal M S Gaur and V Gan-moor ldquoAndroSimilar Robust signature for detecting variantsof Android malwarerdquo Journal of Information Security andApplications vol 22 pp 66ndash80 2015

[21] Y Zhou and X JiangAn analysis of the anserverbot trojan 2011httpwwwcscncsuedufacultyjiangpubsAnserverBot Ana-lysispdf

[22] P Faruki A Bharmal V Laxmi et al ldquoAndroid security a sur-vey of issues malware penetration and defensesrdquo IEEE Com-munications Surveys amp Tutorials vol 17 no 2 pp 998ndash10222015

[23] P Yan and Z Yan ldquoA survey on dynamic mobile malwaredetectionrdquo Software Quality Journal pp 1ndash29 2017

[24] S Dai Y Liu T Wang T Wei and W Zou ldquoBehavior-basedmalware detection on mobile phonerdquo in Proceedings of the 20106th International Conference on Wireless Communications Net-working and Mobile Computing WiCOM 2010 China Septem-ber 2010

[25] D Damopoulos G Kambourakis S Gritzalis and S O ParkldquoExposing mobile malware from the inside (or what is yourmobile app really doing)rdquo Peer-to-Peer Networking and Appli-cations vol 7 no 4 pp 687ndash697 2014

[26] D Damopoulos G Kambourakis and G Portokalidis ldquoThebest of both worlds A framework for the synergistic operationof host and cloud anomaly-based IDS for smartphonesrdquo inProceedings of the 7th European Workshop on System SecurityEuroSec 2014 Netherlands April 2014

[27] Q Xu J Erman A Gerber Z Mao J Pang and S Venkatara-man ldquoIdentifying diverse usage behaviors of smartphone appsrdquoin Proceedings of the ACM SIGCOMM Internet MeasurementConference (IMC rsquo11) pp 329ndash344 November 2011

[28] S Dai A Tongaonkar X Wang A Nucci and D Song ldquoNet-workProfiler Towards automatic fingerprinting of Androidappsrdquo in Proceedings of the 32nd IEEE Conference on ComputerCommunications IEEE INFOCOM 2013 pp 809ndash817 ItalyApril 2013

[29] Q Xu T Andrews Y Liao et al ldquoFlowr A self-learning systemfor classifying mobile application trafficrdquo in Proceedings of the2014 ACM SIGMETRICS International Conference on Measure-ment and Modeling of Computer Systems SIGMETRICS 2014pp 569-570 usa June 2014

[30] Q Xu Y Liao S Miskovic et al ldquoAutomatic generation ofmobile app signatures from traffic observationsrdquo in Proceedings

Security and Communication Networks 15

of the 34th IEEE Annual Conference on Computer Communica-tions and Networks IEEE INFOCOM 2015 pp 1481ndash1489 HongKong May 2015

[31] J Sun L She H Chen et al ldquoAutomatically identifying appsin mobile trafficrdquo Concurrency Computation vol 28 no 14 pp3927ndash3941 2016

[32] G He B Xu and H Zhu ldquoIdentifying Mobile Applications forEncrypted Network Trafficrdquo in Proceedings of the 5th Interna-tional Conference on Advanced Cloud and Big Data CBD 2017pp 279ndash284 China August 2017

[33] G He M Yang X Gu J Luo and Y Ma ldquoA novel activewebsite fingerprinting attack against Tor anonymous systemrdquoin Proceedings of the 2014 18th IEEE International Conference onComputer Supported CooperativeWork inDesign CSCWD2014pp 112ndash117 Taiwan May 2014

[34] G He M Yang J Luo and X Gu ldquoA novel application classi-fication attack against Torrdquo Concurrency and ComputationPractice and Experience vol 27 no 18 pp 5640ndash5661 2015

[35] B Miller L Huang A D Joseph and J D Tygar ldquoI know whyyou went to the clinic Risks and realization of HTTPS trafficanalysisrdquo Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes in Bio-informatics) Preface vol 8555 pp 143ndash163 2014

[36] Y Wang Y Xiang J Zhang W Zhou G Wei and L T YangldquoInternet traffic classification using constrained clusteringrdquoIEEE Transactions on Parallel and Distributed Systems vol 25no 11 pp 2932ndash2943 2014

[37] L J Cao K S Chua W K Chong H P Lee and Q M Gu ldquoAcomparison of PCA KPCA and ICA for dimensionality reduc-tion in support vector machinerdquo Neurocomputing vol 55 no1-2 pp 321ndash336 2003

[38] Y Kim and S Y Sohn ldquoStock fraud detection using peer groupanalysisrdquo Expert Systems with Applications vol 39 no 10 pp8986ndash8992 2012

[39] ldquoOurmon network monitoring and anomaly detection systemrdquoApril 2013 httpourmonsourceforgenet

[40] B Babaki T Guns and S Nijssen ldquoConstrained ClusteringUsing Column Generationrdquo in Integration of AI and OR Tech-niques in Constraint Programming vol 8451 of Lecture Notes inComputer Science pp 438ndash454 Springer International Publish-ing Cham 2014

[41] A Abraham R Andriatsimandefitra A Brunelat J-F Lalandeand V Viet Triem Tong ldquoGroddDroid A gorilla for triggeringmalicious behaviorsrdquo in Proceedings of the 10th InternationalConference on Malicious and Unwanted Software MALWARE2015 pp 119ndash127 USA October 2015

[42] Y Li Z Yang Y Guo and X Chen ldquoDroidBot A lightweightUI-guided test input generator for androidrdquo inProceedings of the39th IEEEACM International Conference on Software Engineer-ing Companion ICSE-C 2017 pp 23ndash26 Argentina May 2017

[43] A Aizawa ldquoAn information-theoretic perspective of tf-idfmeasuresrdquo Information Processing amp Management vol 39 no1 pp 45ndash65 2003

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 5: AppFA: A Novel Approach to Detect Malicious …downloads.hindawi.com/journals/scn/2018/2854728.pdfSecurityandCommunicationNetworks F :Exampleofkey-valuepairs. signs, and the packet

Security and Communication Networks 5

Figure 3 Example of key-value pairs

signs and the packet interarrival times are labelled byasterisk The packet order features are also reflected by thenumber sequences inherently For example for the flow+50 lowast30 +100 lowast500 minus1300 lowast20 minus400 the first packetis an outgoing packet whose size is 50 bytes and the secondpacket is also an outgoing packet and the packet time intervalis 30ms The third packet is an incoming packet whose sizeis 1300 bytes The time interval between the second and thethird packet is 500ms and so forth

Note that the packet contents are equally saved throughPacket Content Extractor module for app identification Sim-ilar to work [30] we focus on key-value pairs in HTTP head-ers In detail justniffer (httpjustniffersourceforgenet) isfirst used to transform the raw packet traffic into HTTPmessages Then HTTP messages are tokenized by severaltokenizers such as space ldquornrdquo ldquordquo and ldquoamprdquo and eachHTTPrequest will be broken into various parts including methodpage and query Finally queries are divided into key-valuepairs Figure 3 is an example of key-value pairs used in ourexperiments

After obtaining the basic features from flows we beginto generate identification features in Identification FeatureGeneratormodule After that wewill identify the app for eachflow through App Traffic Clustering module which returnsthe clustering results to Session Builder module for formingapp sessionsWith the app session information detection fea-tures will be created in Profile Feature Generator module Atlast app network behavior profiles will be constructed inAppBehavior Profile Constructor module and malicious apps willbe detected inMalicious App Detectionmodule respectivelyThe independent treatment of different functional modulescan make AppFA architecture clear and scalable

5 Methodology

As depicted in Figure 2 the functionality of app trafficclustering and malicious app detection (including identifica-tiondetection feature generation) are the core componentsof AppFA and the technical details are clarified in the nextsubsections

51 App Traffic Clustering The basic idea of our app trafficclustering method is illustrated in Figure 4 In the figurethere are two apps and the flow sessions of them are repre-sented by solid and dotted lines respectively For each flowsignature matching is first used to identify HTTP flows thathave recognized signatures Note that in this paper we usethe term signature for plaintext matching and feature for

traffic analysis With this step there are two flows identified(labelled as red and green) as shown in the left dashed box ofFigure 4 After the signaturematching constrained clusteringalgorithm (the second dashed box in Figure 4) is exploitedto cluster all flows Compared to the ordinary clusteringalgorithms constrained clustering algorithm adopts back-ground information (ie identified flows) to improve clusteraccuracy By constrained clustering flows such as encryptedtraffic that cannot be identified by signatures will be classifiedinto appropriate clusters that is apps Finally we obtainapp sessions that will be utilized for creating appsrsquo networkbehavior profiles as depicted in the last step in Figure 4

Formally the entire app identification procedure isdescribed in Algorithm 1 In the while loop the clusteringsignal can be a timeout for cyclic identification (eg every1 hour) or an app session completed for real-time identifica-tion For the details of selecting the initial signature seeds 119878119889one can refer to literature [30]

Algorithm 1 uses a method similar to the FLOWR system[30] to carry out signature matching (line (6)) and takesadvantage of constrained 119870-means clustering algorithm [36]for flow clustering (line (7)) For signature matching thekey-value pairs in HTTP header are considered as apprsquossignatures and an initial set of seeding app signatures isset up to bootstrap the learning of new ones Comparedto FLOWR we refine the process of counting cooccurrenceof app signatures with the constrained clustering results InFLOWR if start time of 119891119897o1199082 is less than 119879 seconds afterthe start time of 1198911198971199001199081 their signatures will be consideredas a cooccurrence instance However as noted in literature[30] if119879 is overestimated FLOWR ismore likely tomix flowsfrom different apps thus inducing noise and overutilizingsystem resources To overcome this problem in AppFA wefurther consider the clustering results to count cooccurrenceof app signatures besides temporal information That meansonly the flows that occurred in 119879 seconds in the same clusterwill be counted as cooccurrenceThis will reduce noises sincethe flows are filtered by constrained clustering

After signature matching constrained 119870-means cluster-ing is carried out to identify the remaining unknown flowsAlgorithm 2 shows the constrained clustering algorithmexploited in AppFA In the constrained flow clustering theflows identified by signature matching which belong to thesame app must be clustered into one cluster (must-link con-straints lines (4)ndash(8) in Algorithm 2) and those generatedby different apps must be clustered into different clusters(cannot-link constraints lines (9)ndash(13) in Algorithm 2) Theclustering features are listed in Table 3 (totally 11 features)Weselect the time of the first packet sent as one of the featuresbecause flows observed within short time intervals are likelyto come from the same app [32]The other features are chosenas they are proved to be efficient in clustering network traffic[36]

In Algorithm 2 we set the value of cluster number 119901 to beequal to the number of apps identified by signaturematchingThis is because the accuracy of mobile app identification isalready higher than 95 [29 32] and it is proper to assumethat the popular apps (malware writers usually graft somemalicious code on popular apps to ensure a wide diffusion

6 Security and Communication Networks

Flow sessionspacket featuresand contents

SignatureMatching

ConstrainedClustering

App sessions

App 1

App 2

Figure 4 Illustration of the basic idea of app identification Solid and dotted lines represent different apps and black color indicates that theflows have not been identified

(1) let 119878 denotes signature set and 119878119889 stands for the signatureseeds

(2) while the clustering signal is received do(3) if 119878 is empty then(4) 119878 = 119878119889(5) end if(6) carry out signature matching(7) carry out constrained flow clustering(8) update 119878 with the clustering results(9) end while

Algorithm 1 App traffic clustering procedure

Table 3 Identification feature set

Time of the first packet sentNumber of packetsVolume of bytesMin max mean and std dev of packet sizeMin max mean and std dev of interpacket time

Table 4 Network behavior profile feature set

Number of flowsNumber of packets of outgoing and incoming flowsVolume of bytes of outgoing and incoming flowsFirst 1198961 results of KPCA transformations of outgoing packet sizesFirst 1198962 results of KPCA transformations of incoming packet sizesFirst 1198961 results of KPCA transformations of outgoing packetintervalsFirst 1198962 results of KPCA transformations of incoming packetintervals

of their malicious code) can all be recognized Thereforeeach cluster will correspond to one app when the constrained119870-means clustering is finished This is appropriate for thefollowing network behavior profile construction and mali-cious behavior detection In practice if the actual numberof apps is greater than 119901 namely some apps cannot beidentified by signature matching the corresponding flowswill be misclustered This may change network behaviors ofapps since unrelated flows will be included and the traffic

features such as the number of packets and the volume ofbytes will be enlarged In this work we use Kernel PrincipalComponent Analysis to remit this problem as described inthe following subsection

52 Network Behavior Profile Construction After the appidentification the network behavior profiles for apps areconstructed from app sessions In AppFA we define the apprsquosnetwork behavior profile as a set of chosen network trafficfeatures as listed in Table 4 Formally the network behaviorprofile can be defined as follows

119899119890119905119861119890ℎ119886V119894119900119903119875119903119900119891119894119897119890 fl 1198911198901198861199051199061199031198901 119891119890119886119905119906119903119890119899 119891119890119886119905119906119903119890119894 isin 119888119900119899119899119890119888119905119894119900119899 119891119890119886119905119906119903119890119904 119889119886119905119886 119891119890119886119905119906119903119890119904

(1)

Since malicious apps need to establish network connec-tions to transmit confidential data or carry out defined attacksteps [18] in this work we mainly choose connection featuresand data features for constructing appsrsquo network behaviorprofiles as shown in (1) The connection features describehow many network connections have been established andthe data features represent characteristics of packets Thewhole selected features for consisting of appsrsquo network behav-ior profiles are listed in Table 4 In the table the first two linesare connection features and the rest are data features

Furthermore for overcoming the misclassification prob-lem (as illustrated in Section 51) and distinguishing minortraffic variations from significant differences we do notuse these features directly Instead KPCA (Kernel PrincipalComponent Analysis) is applied to transform basic featuressuch as packet size and packet time interval Basically KPCA

Security and Communication Networks 7

Input Data set 119865 cluster number 119901 must-link constraints119871119898 sube 119865 times 119865 cannot-link constraints 119871 119888 sube 119865 times 119865

Output Flow clusters(1) Let 1198881 sdot sdot sdot 119888119901 be the initial cluster centers(2) for each flow 119891119894 in 119865 do(3) select the closest cluster 119888(4) for each (119891119894 119891) isin 119871119898 do(5) if119891 notin 119888 then(6) goto step (2)(7) end if(8) end for(9) for each (119891119894 119891) isin 119871 119888 do(10) if119891 isin 119888 then(11) goto step (2)(12) end if(13) end for(14) assign 119891119894 to the cluster 119888(15) end for(16) for each cluster 119888119894 do(17) update its center by averaging all of the flows 119891119895 that

have been assigned to it(18) end for(19) iterate between step (2) and step (18) until convergence(20) return 1198881 sdot sdot sdot 119888119901

Algorithm 2 Constrained flow clustering algorithm

is one approach of generalizing linear PCA into nonlinearcase using the kernel method It has been proved that KPCAhas the best performance in feature extraction and is robust tonoise [37] The details of KPCA used in AppFA are describedas follows

First Gaussian function defined in (2) is selected as thekernel function

119892 (119909 119910) = exp (minus120574 1003817100381710038171003817119909 minus 11991010038171003817100381710038172) (2)

where the value of 120574 is set to 0001 as indicated in [37]Then we compute a Gramkernel matrix 119870 with

119870119894119895 = 119892 (119909(119894) 119909(119895)) (3)

Next the kernel matrix 119870 is centered via the followingfunction

119870centered = 119870 minus 1119873119870 minus 1198701119873 + 1119873119870= (119868 minus 1119873) 119870 (119868 minus 1119873)

(4)

where 1119873 is an 119873 times 119873 matrix with all elements equal to 1119873and 119873 is the number of data points

After that the nonzero eigenvalues 120582119894 and the eigenvec-tors 120572119894 of the centered kernel matrix 119870centered are calculated asfollows

119870centered120572119894 = 120582119894120572119894 (5)

Also the eigenvectors 120572119894 are normalized as

120572119894 =1

radic120582119894119873120572119894 (6)

Finally we sort the eigenvectors in the descending orderof corresponding eigenvalues and perform projections ontothe given subset of eigenvectors This step can be representedas follows

119889119895 =119873

sum119894=1

120572119895119894119892 (119909 119909119894) 119895 = 1 2 119896 (7)

where 119896 is the dimension of the new dataFor each app the length of its network behavior profile is

(1 + 2 lowast (119891119900 + 119891119894) + 2 lowast (1198961 + 1198962)) and 119891119900 is the number ofoutgoing flows and 119891119894 is the number of incoming flows

53 Malicious App Detection With appsrsquo network behaviorprofiles we propose to use peer group analysis [38] to detectmalicious apps The main idea of the detection method isillustrated in Figure 5 An apprsquos network behavior profileis compared with both its historical data and the profilesof its peer group for malware detection Comparison withthe historical profiles can help us find out self-updatingmalicious apps [7] and comparison with the profiles of itspeer group can detect repackagedmalicious apps as observedin Section 2

For AppFA an apprsquos peer group is defined as the set of itsbehavior-similar apps Apps are considered behavior-similarif they satisfy the following two conditions A their mainfunctionality is similar for example all for mailing B theirnetwork behavior profiles are similar If any of the above twoconditions is not met the peer group will be empty and wewill only compare appsrsquo profiles with their historical data

8 Security and Communication Networks

App1 for detection

Update the peergroup of App1

Detection

App1 lt12 100 5000 gt

Network behaviorprofile comparison

Network behaviorprofile comparison

Historical data Peer grouplt11 110 5020 gt

lt10 130 4600 gt

lt12 100 5000 gt

lt12 101 5010 gt

e profile is similar to thehistory App1 has not beeninjected in malicious code

App2 lt20 400 6000 gt

App3 lt30 500 7000 gt

App4 lt25 450 6500 gt

App5 lt19 370 5580 gt

e profile is significantly differentfrom its peer group (behavior-similar apps) App1 may be a fakeapp

Figure 5 Illustration of the procedure of malicious app detection The numbers in angle brackets are examples of features listed in Table 4

Figure 6 Example of similar apps recommended by Google Play

In practice one can resort to app stores such as GooglePlay to find out candidates that will meet condition AFigure 6 shows that when viewing the details of an appGoogle Play will recommend the similar ones These similarapps are determined by several features such as category ofapps keywords in the title and description and size of the apk(httpswwwquoracomHow-does-the-Google-Play-Store-determine-similar-apps) Generally these recommendedapps belong to the same category and have similar function-ality Therefore in order to determine the peer group of anapp we first use the similar apps recommended by app storessuch as Google Play as candidates and later filter them byconditionB

For condition B the Euclidean distance is used tomeasure the similarity of network behavior profiles Suppose119860 is the app to be analyzed and 119861 is the app satisfying con-dition A for 119860 The similarity of network behavior profilesbetween 119860 and 119861 is calculated as

119889 (119860 119861)

= radic(119899119861119875 (119860) minus 119899119861119875 (119861)) (119899119861119875 (119860) minus 119899119861119875 (119861))119879(8)

where 119899119861119875 is the abbreviation of 119899119890119905119861119890ℎ119886V119894119900119903119875119903119900119891119894119897119890 definedin (1)

Once the similarities between 119860 and the recommendedapps are calculated by (8) the results are sorted in the orderof increasing distance (ie decreasing similarity) The first119892 apps will be selected as the peer group members of 119860 asshown in (9) The optimal value of 119892 can be determined bythe experiments which is discussed in the next section Notethat the peer group members can be updated For example ifone of the peer groupmembers has been flagged asmaliciousit will be removed from the peer group and a new one will beadded AppFA can also reselect the peer groups every 119879 timeinterval

peerGroup (119860) fl app1 app119892

where 119889 (119860 app119894) ge 119889 (119860 app119895) if 119894 lt 119895(9)

As illustrated in Figures 2 and 5 for an identified appAppFA first constructs its network behavior profile and then

Security and Communication Networks 9

compares the profile with both its historical profiles andthe profiles of its peer groups for malicious apps detectionDenote 119900 as the profile of the identified app 119860 and 119866 as thematrix of the compared profiles (historical profiles or peergroup profiles) Each column of 119866 is one profile and thecolumn length is 119897 (119897 = (1 + 2 lowast (119891119900 + 119891119894) + 2 lowast (1198961 + 1198962)))AppFAmakes the feature vectors the same length by paddingzeros

For peer group analysis firstly 119900 and 119866 are normalized bymin-max normalization procedure defined in

119900119894 =119900119894

max (119900119894 1198661198941 1198661198942 119866119894119892)

119866119894119895 =119866119894119895

max (119900119894 1198661198941 1198661198942 119866119894119892)

119894 = 1 2 119897 119895 = 1 2 119892

(10)

Then the distance between 119900 and 119866 is calculated byMahalanobis distance

120591 = radic(119900 minus 119898)119879 119878minus1 (119900 minus 119898) (11)

where 119898 is the weighted mean vector of profiles

1198981198941 =119892

sum119895=1

119908119860119895119866119894119895 1 le 119894 le 119897 (12)

and 119878 is the covariance matrix and it is calculated in

119878 = 119864 (119866 minus 119864 [119866]) (119866 minus 119864 [119866])119879 (13)

In (12) 119908119860119895 is the weight of the 119895th closest peer groupmember of the analyzed app 119860 The weights of peer groupmembers are obtained from their proximity to 119860 In detailwe define the proximity of the 119895th closest peer groupmemberof the target app 119860 in

prox119860119895 = exp (minus119889 (119860 119895)) (14)

where 119889(119860 119895) is the Euclidean distance between app 119860 and its119895th closest peer group member defined in (8) Based on theproximity measure we defined above the weight of the 119895thclosest peer group member of app 119860 is defined in

119908119860119895 =prox119860119895

sum119892119895=1 prox119860119895(15)

Finally the state of apps is judged as follows

120591 gt 119905119904 malicious app

120591 le 119905119904 normal app(16)

where 119905119904 is the threshold and it can be set to different valuesby network administrators based on the actual network con-dition We evaluated different values of 119905119904 in our experiment

Internet

ServerUbuntu 1604

functions

Testing mobiledevices(HUAWEI andMI Phones)

(i) Access point(ii) Traffic collection

(iii) Data analysis

Samples(apps)

Figure 7 Experimental setup for the detection of mobile maliciousapps

6 Evaluation

61 Data Collection

Experimental Setup The experimental setup used for AppFAis shown in Figure 7 We have implemented a prototypeof AppFA with the help of ourmon [39] and CCCG [40]Ourmon is an open-source networkmonitoring and anomalydetection system and CCCG is a general framework forconstrained clustering A Ubuntu 1604 computer has beenconfigured as the access point For packet capturing thesmart phone is connected to the Internet by WIFI andnetwork traffic is collected at the access point by tcpdump(httpwwwtcpdumporg) Each pcap file is fixed up to100MB When packet capture is complete the TCP andUPD flows are split by SplitCap (httpswwwnetreseccompage=SplitCap) from pcap files For appsrsquo traffic clusteringand network behavior profile construction we use tshark(httpswwwwiresharkorgdocsman-pagestsharkhtml) toextract IP address packet sizes and packet interval timesfrom network flows After that a Python program is writtento calculate the statistical features as listed in Tables 3 and 4The KPCA transformation is accomplished with the help ofscikit-learn (httpscikit-learnorg)The data analysis is alsocompleted in theUbuntu 1604 computer (with 4GBmemoryand Pentium Dual-Core CPU T4500)

TrafficGenerationWe use the publicMalGenome [19] datasetin our experiments Since we mainly focus on repackag-ing and updating malwares 93 typical information collectionmalwares including repackaging and updating attack typesare selected from MalGenome to test the detection rate ofAppFA (the malicious apps dataset was downloaded fromhttpwwwmalgenomeprojectorg in 2015 These selectedmalwares are the ones run without any errors) We haveinstalled all these malicious apps in HUAWEI Honor 8 and aMI Note 4 phones and run these apps one by one 50 times tocollect network traffic In detail these malwares are analyzedand run by GroddDroid [41] to make sure that maliciouscodes will be triggered We also use GroddDroid to runother malwares besides the selected ones and their trafficwill be mainly used in the local detection as described in

10 Security and Communication Networks

Section 63 Particularly the background traffic includingweather forecasting email checking QQ and Webchat (QQand Webchat are the most famous apps in China) is allowedin our experiments The ground truth of the originator ofnetwork traffic is determined by Packet Capture (httpsplaygooglecomstoreappsdetailsid=appgreyshirtssslcaptureamphl=zh) to test the accuracy of app traffic clustering

For testing the false positive rates of AppFA we rely onthe GooglePlayAppsCrawlerpy project to identify the 100most popular free apps These 100 free apps have also beenrun one by one 50 times While different from malwaresthe benign apps are run by droidbot [42] to make sure togenerate enough traffic With these captured traffic we takeup all nonzero packets for app traffic clustering and networkbehavior profile construction Namely in our experimentsthe value of 119899 in Packet filter model is set to minus1

The data used in the experiments is summarized inTable 5 TrafficSet 1 consists of network traffic generated bythe selected 93 malwares and TrafficSet 2 consists of networktraffic of benign apps TrafficSet 3 is the traffic of all malwaresthus it includes TrafficSet 1 TrafficSet 3 is mainly used formalicious apps detection in local networks

62 Experimental Results

Accuracy of Traffic Clustering TrafficSet 1 is used for testingthe accuracy of app traffic clustering and Packet Capture isexploited to get the ground truth of the originator of networktraffic Recall that in Algorithm 1 app traffic clustering mustbe started with signature seeds In order to obtain appsrsquosignature seeds we randomly choose several HTTP flowsfor each app and extract the key-value pairs in their HTTPheaders as signature seeds as described in Section 4 Thevalue of cluster number 119901 in Algorithm 2 is set to 97 (97 =93 + 4) since there are 93 malwares and 4 background appsthat is weather forecasting email checking QQ andWebchatThe experimental results are shown in Table 6

As shown in Table 6 for each app when 5 HTTP flowsare chosen to generate signature seeds only 601 flows (in all223220 flows) aremisclusteredThis proves the efficiencies ofour proposed method for app traffic clustering

Experimental Results of Malicious App Detection Similarto previous work in order to measure the effectiveness ofmalicious apps detection accuracy (detection rate) and falsepositive rate metrics are defined in (17) and (18) where TPFN FP and TN stand for true positive false negative falsepositive and true negative respectively

Accuracy = TPTP + FN

(17)

False positive rate = FPFP + TN

(18)

We first examine the detection rates and false positiverates of AppFA with different values of 119905119904 and 119896119894 (119894 = 1 2)For the selected malicious apps the apps with the samefunctionality (satisfying condition A) are determined byGoogle Play We first analyze the malicious appsrsquo function-alities manually and choose a keyword for each app Then

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

05

06

07

08

09

1

Det

ectio

n ra

te

06 08 1 12 14 16 18 2 04ts

Figure 8 Detection rates of AppFA with different values of 119905119904 and119896119894 (119894 = 1 2)

we search the keyword in Google Play and use the returnedapps as candidates After that the peer group is determined by(8) and (9) In our experiments the largest value of membercount 119892 is set to 10 Note that 119905119904 is the threshold defined in(16) and 119896119894 is the number of KPCA transformations definedin Table 4 We set the member count 119892 in peer group analysisto 5 namely for each tested app its top 5 similar apps areselected and analyzedThedetection rate is shown in Figure 8and the false positive rate is depicted in Figure 9As illustratedin Figures 8 and 9 the first 4 KPCA transformations of packetsizes and intervals are good enough for constructing networkbehavior profileWhen 1198961 = 1198962 = 4 and 119905119904 gt 12 the detectionrate is higher than 90 and the false positive rate is lowerthan 04 The detection rate is as high as 97 when 119905119904 = 2The experimental results demonstrate the effectiveness of ourproposed approach

We then test AppFA with different value of 119892 The valuesof 119905119904 and 119896119894 (119894 = 1 2) are set to 2 and 4 respectivelyThe experimental results are shown in Figures 10 and 11 Asillustrated in these figures when 119892 increases the detectionrates and the false positive rates are slightly changed Thiscan be explained as the Mahalanobis distance 120591 defined in(11) is calculated by considering the weighted mean vector ofprofiles So the most similar app has the largest weight andthe added peers will have smaller weights and thus providelittle contribution to the final detection As observed in ourexperiments 119892 = 5 is suitable for practical deployment

Further AppFA with different value of cluster number 119901is also tested Note that the actual cluster number is 97 forTrafficSet 1 In order to evaluate how app traffic clusteringimpacts the detection of repackaged malware we slightlychange the value of 119901 and the experiments results are asshown in Figure 12 In the experiments we set 1198961 = 1198962 = 4119905119904 = 2 and 119892 = 5 As shown in Figure 12 the cluster numberindeed impacts the detection rate of AppFA Particularlywhen 119901 = 99 the detection rate is only 88 namely only 82

Security and Communication Networks 11

Table 5 Data summary

Traffic set Apps Quantity Number of collectedflows (TCP and UPD) Amount of data (bytes)

TrafficSet 1 Selected malicious apps(for testing accuracy of traffic clustering and detection rates) 93 223220 538G

TrafficSet 2 Benign apps(for testing false positive rates of AppFA) 100 305103 640G

TrafficSet 3 All malicious apps(for malicious apps detection in local networks) 1260 asymp31 million 1872G

Table 6 Experimental results of traffic clustering

Number of HTTP flows forgenerating signature seeds

Number of flowscorrectly clustered

Number ofmisclustered

flows2 201357 218633 218946 42744 221384 18365 222619 601

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

00005

0001

00015

0002

00025

0003

00035

0004

False

pos

itive

rate

06 08 1 12 14 16 18 2 04ts

Figure 9 False positive rates of AppFA with different values of 119905119904and 119896119894(119894 = 1 2)

malicious apps are correctly detected However when 119901 lt 97(119901 = 95 and 96 in Figure 12) the detection rates are almost thesame namely 97Therefore AppFAwill be competent witha few unknown apps when carrying out app traffic clustering

Finally we evaluate AppFA with the remaining maliciousapps besides the selected 93 samplesThe rest of themaliciousapps contain repackaged and other types of malwares Thesemalwares are found to have some errors when run on ourphones However they also produced considerable networktraffic We test AppFA with these apps for examining thegeneral capability of AppFA The parameters are set up asfollows 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 The experimentalresults are shown in Table 7 The detection rate is 734and the false positive rate is 35 Compared to the results

1 3 5 7 10g

Detection rate

0

01

02

03

04

05

06

07

08

09

1

rate

()

Figure 10 Detection rates of AppFA with different values of 119892

1 3 5 7 10g

False positive rate

times10minus3

0

1

2

3

4

5

6

7

rate

()

Figure 11 False positive rates of AppFA with different values of 119892

displayed in Figures 8 9 10 and 11 the detection rate issignificantly reduced and the false positive rate is increasedThe possible reasons will be discussed in Section 64

Note that AppFA is designed to perform nearly real-timemalicious apps detection thus efficiency is also a big concernTable 8 presents the computational performance for major

12 Security and Communication Networks

99989695value of p

0

01

02

03

04

05

06

07

08

09

1

dete

ctio

n ra

te (

)

Figure 12 Detection rates of AppFA with different values of 119901

Table 7 Experimental results for the malicious apps besides theselected 93 samples

Detection rate False positive rate734 35

Table 8 The computational performance

Procedures TimeConstrained flow clustering (Algorithm 1) 3minApp identification (Algorithm 2) 52minNetwork behavior profile construction 17minPeer group analysis 45 sec

steps in terms of average running time using the experimentaldata Generally the proposedmethod can be deployed onlineto detect malicious apps Specifically it is fast typically lessthan one minute to exploit peer group analysis to performmalicious app detection The app identification procedureis the most time-consuming step because the signaturematching and constrained flow clustering are carried out oneach flow Parallel computing such as cloud resources can befurther used to speed the malicious apps detection and scaleup to more Internet traffic data

Detection of Malicious Repackaged Apps for Encrypted TrafficAs shown in Table 4 AppFA does not consider traffic contentand mainly uses traffic statistical features to detect maliciousAndroid apps Thus AppFA can deal with encrypted trafficSince in MalGenome there is no app whose all connectionsto attack servers are encrypted we have created a newAndroid repackaged malware that communicates with allmalicious servers by encrypted connections In detail wefirst modify AnserverBotrsquos source code and make sure itsnetwork connectionswill be encrypted byTLS protocolThenwe graft it into a popular game app aircomaceviralmotox3Finally we collect network traffic of the repackaged aircomaceviralmotox3 and its peer group In the experiments

Table 9 Experimental results when searching similar apps locally

Detection rate False positive rate692 47

the selected peer group of the repackaged aircomaceviralmotox3 is comwordmobilesbikeRacing comtopfreegamesbikeracefreeworld comskgamestrafficrider comtomicowheel-iechallenge and madmotoracingbike We repeat to detectthe repackaged aircomaceviralmotox3 10 times and thedetection rate is 100 The results show that AppFA canhandle encrypted network traffic

63 Detection in Local Networks In the above experimentsthe peer group for each app is determinedwith the help of appstore such as theGoogle PlayThatmeansAppFAhas to accessthe Internet when performing malicious apps detection Infact AppFA can also be enhanced to choose the appsrsquo peergroups locally namely from the set of already identified appsNote that AppFA uses signature matching to identify appson the network and the identified ones can be treated as amini app store By this AppFA can work on local networksTo find out similar apps from the already identified ones anefficient method based on information retrieval technologiesis proposed and is shown in Algorithm 3

The basic idea of Algorithm 3 is that similar apps mayhave similar functionalities and the context (keywords) maybe similarTherefore we useweb-page searching technologiesto match similar app The 119872119886119905119888ℎ119894119899119892 function in line (4) canbe realized by TF-IDF (Term Frequency-Inverse DocumentFrequency) [43] Again peer groups can be chose by (8) withthe results returned by Algorithm 3

We have implemented the local similar app searching andevaluated AppFArsquos performance in local networks All appsincluding all malicious and benign apps are considered asidentified apps and Algorithm 3 is carried out to find outsimilar apps When 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 for the93 typical repackaged malwares the experimental results arelisted in Table 9 The detection rate is 692 lower than 97in Figure 8The experimental results indicate that AppFA canwork in local networks and the proper peer groups do affectthe final detection significantly

64 Discussion Figures 8 9 10 and 11 show that AppFAgets high detection rates and low false positive rates whendetecting repackaged apps However the detection rate maybecome lower when it is applied to other types of maliciousapps such as sending SMS without notification This may bedue to the fact that other types of mobile malwares transmitless data through network than repackaged apps Note thatthe detection features used inAppFA aremainly related to thepackets sizes and the number of flows So the detection ratemay be declined if the appsrsquo network behaviors are slightlychanged

Compared to the latest work done in [6] AppFA has aslightly lower detection rate (the detection rate is 95ndash999 asreported in [6]) However Garg et alrsquos method [6] consideredonly 18 malware apps and 14 genuine apps and assumed

Security and Communication Networks 13

(1) extract all plaintext keywords from HTTP headers of thetested app 119860 Denote the set of keywords as 119882(119860)

(2) extract all plaintext keywords from HTTP headers of otheridentified apps Denote the set of keywords of identifiedapp 119894 as 119882(119894) 119894 = 0 1 119899 minus 1

(3) while 0 le 119899 do(4) 119878119888119900119903119890119904[119894]V119886119897119906119890 = 119872119886119905119888ℎ119894119899119892(119882(119860) 119882(119894))(5) 119878119888119900119903119890119904[119894]119886119901119901 = 119894(6) 119894 + +(7) end while(8) sorting 119878119888119900119903119890119904[] with 119878119888119900119903119890119904[119894]V119886119897119906119890 by decreasing order(9) return the first 119898119878119888119900119903119890119904[119894]119886119901119901 119894 = 0 1 119898 minus 1

Algorithm 3 Searching similar apps locally

Table 10 Experimental results when the fact that what traffic isgenerated by what app is exactly known In the experiments 1198961 =1198962 = 4 119905119904 = 2 and 119892 = 5

Selected maliciousapps

Remaining maliciousapps

Detection rate 983 85False positiverate 003 27

that the traffic generated by apps was already known (theirdata were collected on the mobile device so their methodis actually a kind of client-side approach) In our workmuch larger samples are taken into account and constrainedclustering is exploited to determine to what app the flowsbelong The accuracy of clustering is also a factor affectingthe detection rate To confirm this we further take the sameassumption as Garg et alrsquos method [6] namely we assumethe accuracy of Algorithm 1 is 100 and know exactly whattraffic is generated by what app (in our experiments we useapp Packet Capture to obtain perfect ground truth of whatflows came from what app) At this time the experimentalresults are given in Table 10 Apparently the detection ratesare much improved

In practice one can deploy both our method and othermethods such as [6 15] simultaneously As indicated by ourexperiments AppFA is efficient in detecting repackaged andself-updatingmalicious apps In fact [15] is suitable forHTTPanalysis and [6] can detect CampC communication efficientlySo our work can be a complement of existing work onmalicious app detection

In this work AppFA mainly uses the statistical features(refer to Table 4) of network traffic to detect Android mali-cious repackaged applications Therefore attackers (repack-aged apps) may change the characteristics of their traffic toremain undetectable But it is quite difficult in practice Asillustrated in Section 2 repackaged apps usually introduceadditional network traffic thus the attackers must removesome normal network connections to keep network behav-iors the same to evade detection However the removal of

normal network connections will impact the functionalitiesof apps and may cause errors This gives additional clues todetect repackaged malware Meanwhile the disappearanceof some network connections may be abnormal as wellTherefore our proposed method is hard to evade

7 Conclusion

In this paper we propose a novel approach AppFA to detectmalicious apps at the network level In AppFA apps are firstidentified from network traffic by signature matching andconstrained clustering Then Kernel Principal ComponentAnalysis is employed to construct app network behavior pro-file and distinguish minor traffic variations from significantdifferences At last we take advantage of peer group analysisto detect malicious apps to avoid time-consuming offlinemodel training Notably AppFA does not need to installprograms or modify operating systems to collect featureinformation Thus it is very convenient to be used and thecost is low The experimental results show that AppFA candetect Android repackaged malware with the detection ratehigher than 90 and a false positive rate lower than 04

The appsrsquo network behaviorsmay be significantly changedby version update and thus cause false positives This needsto be further investigated In the future work AppFA willbe extended to include more network traffic features anddetect more types of malicious apps We will also makeAppFA an open-source project to facilitate further researchesof malicious app detection

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by National Natural Science Founda-tion of China under Grants 61702282 and 61502250 NaturalScience Foundation of the Jiangsu Higher Education Insti-tutions of China under Grant 17KJB520023 NUPTSF underGrant NY217143 and Nanjing Forestry University (GXL016CX2016026)

14 Security and Communication Networks

References

[1] Android statistics Number of android applications September2017 httpwwwappbraincomstatsnumber-of-android-apps

[2] A Mylonas M Theoharidou and D Gritzalis ldquoAssessingPrivacy Risks in Android A User-Centric Approachrdquo in RiskAssessment and Risk-Driven Testing Lecture Notes in ComputerScience pp 21ndash37 Springer International Publishing Cham2014

[3] Does Does your mobile anti-virus app protect or infect you thetruth behind du antivirus security September 2017 httpsresearchcheckpointcommobile-anti-virus-app-protect-infect-truth-behind-du-antivirus-security

[4] A P Felt M Finifter E Chin S Hanna and D Wagner ldquoAsurvey of mobile malware in the wildrdquo in Proceedings of the 1stACM Workshop on Security and Privacy in Smartphones andMobile Devices (SPSM rsquo11) held in Association with the 18th ACMConference on Computer and Communications Security (CCSrsquo11) pp 3ndash14 October 2011

[5] K Tam A Feizollah N B Anuar R Salleh and L Caval-laro ldquoThe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys vol 49 no 4 Article ID3017427 2017

[6] S Garg S K Peddoju and A K Sarje ldquoNetwork-based detec-tion of Android malicious appsrdquo International Journal of Infor-mation Security pp 1ndash16 2016

[7] A Shabtai L Tenenboim-Chekina D Mimran L Rokach BShapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquo Com-puters amp Security vol 43 pp 1ndash18 2014

[8] L Xie X Zhang J-P Seifert and S Zhu ldquoPBMDS a behavior-based malware detection system for cellphone devicesrdquo inProceedings of the 3rd ACM Conference on Wireless NetworkSecurity (WiSec rsquo10) pp 37ndash48 Hoboken NJ USAMarch 2010

[9] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacy inSmartphones and Mobile Devices (SPSM rsquo11) Held in Associationwith the 18th ACM Conference on Computer and Communica-tions Security (CCS rsquo11) pp 15ndash25 October 2011

[10] GDini FMartinelli A Saracino andD Sgandurra ldquoMADAMa multi-level anomaly detector for android malwarerdquo in Com-puter Network Security vol 7531 of Lecture Notes in ComputerScience pp 240ndash253 Springer Berlin Germany 2012

[11] Y Zhang M Yang B Xu et al ldquoVetting undesirable behaviorsin android apps with permission use analysisrdquo in Proceedingsof the in Proceedings of the 2013 ACM SIGSAC conference onComputer amp communications security pp 611ndash622 2013

[12] S Wang Z Chen X Li L Wang K Ji and C Zhao ldquoAndroidMalware Clustering Analysis on Network-Level Behaviorrdquo inIntelligent Computing Theories and Application vol 10361 ofLecture Notes in Computer Science pp 796ndash807 SpringerInternational Publishing Cham Switzerland 2017

[13] A S Shamili C Bauckhage and T Alpcan ldquoMalware detectionon mobile devices using distributed machine learningrdquo inProceedings of the 20th International Conference on PatternRecognition (ICPR rsquo10) pp 4348ndash4351 August 2010

[14] Y Wang J Wei and K Vangury ldquoBring your own device secu-rity issues and challengesrdquo in Proceedings of the 2014 IEEE 11thConsumer Communications and Networking Conference CCNC2014 pp 80ndash85 USA January 2014

[15] P S Chen S-C Lin and C-H Sun ldquoSimple and effectivemethod for detecting abnormal internet behaviors of mobiledevicesrdquo Information Sciences vol 321 Article ID 11536 pp 193ndash204 2015

[16] Global mobile os market share in sales to end users from 1stquarter 2009 to 1st quarter 2017 2017 httpswwwstatistacomstatistics266136global-market-share-held-by-smartphone-op-eratingsystems

[17] ldquoIt threat evolution q3 2017 statisticsrdquo November 2017 httpssecurelistcomit-threat-evolution-q3-2017-statistics83131

[18] L Li D Li T F Bissyande et al ldquoUnderstanding Android AppPiggybacking A Systematic Study of Malicious Code GraftingrdquoIEEE Transactions on Information Forensics and Security vol 12no 6 pp 1269ndash1284 2017

[19] Y Zhou and X Jiang ldquoDissecting android malware characteri-zation and evolutionrdquo in Proceedings of the 33rd IEEE Sympo-sium on Security and Privacy pp 95ndash109 San Francisco CalifUSA May 2012

[20] P Faruki V Laxmi A Bharmal M S Gaur and V Gan-moor ldquoAndroSimilar Robust signature for detecting variantsof Android malwarerdquo Journal of Information Security andApplications vol 22 pp 66ndash80 2015

[21] Y Zhou and X JiangAn analysis of the anserverbot trojan 2011httpwwwcscncsuedufacultyjiangpubsAnserverBot Ana-lysispdf

[22] P Faruki A Bharmal V Laxmi et al ldquoAndroid security a sur-vey of issues malware penetration and defensesrdquo IEEE Com-munications Surveys amp Tutorials vol 17 no 2 pp 998ndash10222015

[23] P Yan and Z Yan ldquoA survey on dynamic mobile malwaredetectionrdquo Software Quality Journal pp 1ndash29 2017

[24] S Dai Y Liu T Wang T Wei and W Zou ldquoBehavior-basedmalware detection on mobile phonerdquo in Proceedings of the 20106th International Conference on Wireless Communications Net-working and Mobile Computing WiCOM 2010 China Septem-ber 2010

[25] D Damopoulos G Kambourakis S Gritzalis and S O ParkldquoExposing mobile malware from the inside (or what is yourmobile app really doing)rdquo Peer-to-Peer Networking and Appli-cations vol 7 no 4 pp 687ndash697 2014

[26] D Damopoulos G Kambourakis and G Portokalidis ldquoThebest of both worlds A framework for the synergistic operationof host and cloud anomaly-based IDS for smartphonesrdquo inProceedings of the 7th European Workshop on System SecurityEuroSec 2014 Netherlands April 2014

[27] Q Xu J Erman A Gerber Z Mao J Pang and S Venkatara-man ldquoIdentifying diverse usage behaviors of smartphone appsrdquoin Proceedings of the ACM SIGCOMM Internet MeasurementConference (IMC rsquo11) pp 329ndash344 November 2011

[28] S Dai A Tongaonkar X Wang A Nucci and D Song ldquoNet-workProfiler Towards automatic fingerprinting of Androidappsrdquo in Proceedings of the 32nd IEEE Conference on ComputerCommunications IEEE INFOCOM 2013 pp 809ndash817 ItalyApril 2013

[29] Q Xu T Andrews Y Liao et al ldquoFlowr A self-learning systemfor classifying mobile application trafficrdquo in Proceedings of the2014 ACM SIGMETRICS International Conference on Measure-ment and Modeling of Computer Systems SIGMETRICS 2014pp 569-570 usa June 2014

[30] Q Xu Y Liao S Miskovic et al ldquoAutomatic generation ofmobile app signatures from traffic observationsrdquo in Proceedings

Security and Communication Networks 15

of the 34th IEEE Annual Conference on Computer Communica-tions and Networks IEEE INFOCOM 2015 pp 1481ndash1489 HongKong May 2015

[31] J Sun L She H Chen et al ldquoAutomatically identifying appsin mobile trafficrdquo Concurrency Computation vol 28 no 14 pp3927ndash3941 2016

[32] G He B Xu and H Zhu ldquoIdentifying Mobile Applications forEncrypted Network Trafficrdquo in Proceedings of the 5th Interna-tional Conference on Advanced Cloud and Big Data CBD 2017pp 279ndash284 China August 2017

[33] G He M Yang X Gu J Luo and Y Ma ldquoA novel activewebsite fingerprinting attack against Tor anonymous systemrdquoin Proceedings of the 2014 18th IEEE International Conference onComputer Supported CooperativeWork inDesign CSCWD2014pp 112ndash117 Taiwan May 2014

[34] G He M Yang J Luo and X Gu ldquoA novel application classi-fication attack against Torrdquo Concurrency and ComputationPractice and Experience vol 27 no 18 pp 5640ndash5661 2015

[35] B Miller L Huang A D Joseph and J D Tygar ldquoI know whyyou went to the clinic Risks and realization of HTTPS trafficanalysisrdquo Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes in Bio-informatics) Preface vol 8555 pp 143ndash163 2014

[36] Y Wang Y Xiang J Zhang W Zhou G Wei and L T YangldquoInternet traffic classification using constrained clusteringrdquoIEEE Transactions on Parallel and Distributed Systems vol 25no 11 pp 2932ndash2943 2014

[37] L J Cao K S Chua W K Chong H P Lee and Q M Gu ldquoAcomparison of PCA KPCA and ICA for dimensionality reduc-tion in support vector machinerdquo Neurocomputing vol 55 no1-2 pp 321ndash336 2003

[38] Y Kim and S Y Sohn ldquoStock fraud detection using peer groupanalysisrdquo Expert Systems with Applications vol 39 no 10 pp8986ndash8992 2012

[39] ldquoOurmon network monitoring and anomaly detection systemrdquoApril 2013 httpourmonsourceforgenet

[40] B Babaki T Guns and S Nijssen ldquoConstrained ClusteringUsing Column Generationrdquo in Integration of AI and OR Tech-niques in Constraint Programming vol 8451 of Lecture Notes inComputer Science pp 438ndash454 Springer International Publish-ing Cham 2014

[41] A Abraham R Andriatsimandefitra A Brunelat J-F Lalandeand V Viet Triem Tong ldquoGroddDroid A gorilla for triggeringmalicious behaviorsrdquo in Proceedings of the 10th InternationalConference on Malicious and Unwanted Software MALWARE2015 pp 119ndash127 USA October 2015

[42] Y Li Z Yang Y Guo and X Chen ldquoDroidBot A lightweightUI-guided test input generator for androidrdquo inProceedings of the39th IEEEACM International Conference on Software Engineer-ing Companion ICSE-C 2017 pp 23ndash26 Argentina May 2017

[43] A Aizawa ldquoAn information-theoretic perspective of tf-idfmeasuresrdquo Information Processing amp Management vol 39 no1 pp 45ndash65 2003

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 6: AppFA: A Novel Approach to Detect Malicious …downloads.hindawi.com/journals/scn/2018/2854728.pdfSecurityandCommunicationNetworks F :Exampleofkey-valuepairs. signs, and the packet

6 Security and Communication Networks

Flow sessionspacket featuresand contents

SignatureMatching

ConstrainedClustering

App sessions

App 1

App 2

Figure 4 Illustration of the basic idea of app identification Solid and dotted lines represent different apps and black color indicates that theflows have not been identified

(1) let 119878 denotes signature set and 119878119889 stands for the signatureseeds

(2) while the clustering signal is received do(3) if 119878 is empty then(4) 119878 = 119878119889(5) end if(6) carry out signature matching(7) carry out constrained flow clustering(8) update 119878 with the clustering results(9) end while

Algorithm 1 App traffic clustering procedure

Table 3 Identification feature set

Time of the first packet sentNumber of packetsVolume of bytesMin max mean and std dev of packet sizeMin max mean and std dev of interpacket time

Table 4 Network behavior profile feature set

Number of flowsNumber of packets of outgoing and incoming flowsVolume of bytes of outgoing and incoming flowsFirst 1198961 results of KPCA transformations of outgoing packet sizesFirst 1198962 results of KPCA transformations of incoming packet sizesFirst 1198961 results of KPCA transformations of outgoing packetintervalsFirst 1198962 results of KPCA transformations of incoming packetintervals

of their malicious code) can all be recognized Thereforeeach cluster will correspond to one app when the constrained119870-means clustering is finished This is appropriate for thefollowing network behavior profile construction and mali-cious behavior detection In practice if the actual numberof apps is greater than 119901 namely some apps cannot beidentified by signature matching the corresponding flowswill be misclustered This may change network behaviors ofapps since unrelated flows will be included and the traffic

features such as the number of packets and the volume ofbytes will be enlarged In this work we use Kernel PrincipalComponent Analysis to remit this problem as described inthe following subsection

52 Network Behavior Profile Construction After the appidentification the network behavior profiles for apps areconstructed from app sessions In AppFA we define the apprsquosnetwork behavior profile as a set of chosen network trafficfeatures as listed in Table 4 Formally the network behaviorprofile can be defined as follows

119899119890119905119861119890ℎ119886V119894119900119903119875119903119900119891119894119897119890 fl 1198911198901198861199051199061199031198901 119891119890119886119905119906119903119890119899 119891119890119886119905119906119903119890119894 isin 119888119900119899119899119890119888119905119894119900119899 119891119890119886119905119906119903119890119904 119889119886119905119886 119891119890119886119905119906119903119890119904

(1)

Since malicious apps need to establish network connec-tions to transmit confidential data or carry out defined attacksteps [18] in this work we mainly choose connection featuresand data features for constructing appsrsquo network behaviorprofiles as shown in (1) The connection features describehow many network connections have been established andthe data features represent characteristics of packets Thewhole selected features for consisting of appsrsquo network behav-ior profiles are listed in Table 4 In the table the first two linesare connection features and the rest are data features

Furthermore for overcoming the misclassification prob-lem (as illustrated in Section 51) and distinguishing minortraffic variations from significant differences we do notuse these features directly Instead KPCA (Kernel PrincipalComponent Analysis) is applied to transform basic featuressuch as packet size and packet time interval Basically KPCA

Security and Communication Networks 7

Input Data set 119865 cluster number 119901 must-link constraints119871119898 sube 119865 times 119865 cannot-link constraints 119871 119888 sube 119865 times 119865

Output Flow clusters(1) Let 1198881 sdot sdot sdot 119888119901 be the initial cluster centers(2) for each flow 119891119894 in 119865 do(3) select the closest cluster 119888(4) for each (119891119894 119891) isin 119871119898 do(5) if119891 notin 119888 then(6) goto step (2)(7) end if(8) end for(9) for each (119891119894 119891) isin 119871 119888 do(10) if119891 isin 119888 then(11) goto step (2)(12) end if(13) end for(14) assign 119891119894 to the cluster 119888(15) end for(16) for each cluster 119888119894 do(17) update its center by averaging all of the flows 119891119895 that

have been assigned to it(18) end for(19) iterate between step (2) and step (18) until convergence(20) return 1198881 sdot sdot sdot 119888119901

Algorithm 2 Constrained flow clustering algorithm

is one approach of generalizing linear PCA into nonlinearcase using the kernel method It has been proved that KPCAhas the best performance in feature extraction and is robust tonoise [37] The details of KPCA used in AppFA are describedas follows

First Gaussian function defined in (2) is selected as thekernel function

119892 (119909 119910) = exp (minus120574 1003817100381710038171003817119909 minus 11991010038171003817100381710038172) (2)

where the value of 120574 is set to 0001 as indicated in [37]Then we compute a Gramkernel matrix 119870 with

119870119894119895 = 119892 (119909(119894) 119909(119895)) (3)

Next the kernel matrix 119870 is centered via the followingfunction

119870centered = 119870 minus 1119873119870 minus 1198701119873 + 1119873119870= (119868 minus 1119873) 119870 (119868 minus 1119873)

(4)

where 1119873 is an 119873 times 119873 matrix with all elements equal to 1119873and 119873 is the number of data points

After that the nonzero eigenvalues 120582119894 and the eigenvec-tors 120572119894 of the centered kernel matrix 119870centered are calculated asfollows

119870centered120572119894 = 120582119894120572119894 (5)

Also the eigenvectors 120572119894 are normalized as

120572119894 =1

radic120582119894119873120572119894 (6)

Finally we sort the eigenvectors in the descending orderof corresponding eigenvalues and perform projections ontothe given subset of eigenvectors This step can be representedas follows

119889119895 =119873

sum119894=1

120572119895119894119892 (119909 119909119894) 119895 = 1 2 119896 (7)

where 119896 is the dimension of the new dataFor each app the length of its network behavior profile is

(1 + 2 lowast (119891119900 + 119891119894) + 2 lowast (1198961 + 1198962)) and 119891119900 is the number ofoutgoing flows and 119891119894 is the number of incoming flows

53 Malicious App Detection With appsrsquo network behaviorprofiles we propose to use peer group analysis [38] to detectmalicious apps The main idea of the detection method isillustrated in Figure 5 An apprsquos network behavior profileis compared with both its historical data and the profilesof its peer group for malware detection Comparison withthe historical profiles can help us find out self-updatingmalicious apps [7] and comparison with the profiles of itspeer group can detect repackagedmalicious apps as observedin Section 2

For AppFA an apprsquos peer group is defined as the set of itsbehavior-similar apps Apps are considered behavior-similarif they satisfy the following two conditions A their mainfunctionality is similar for example all for mailing B theirnetwork behavior profiles are similar If any of the above twoconditions is not met the peer group will be empty and wewill only compare appsrsquo profiles with their historical data

8 Security and Communication Networks

App1 for detection

Update the peergroup of App1

Detection

App1 lt12 100 5000 gt

Network behaviorprofile comparison

Network behaviorprofile comparison

Historical data Peer grouplt11 110 5020 gt

lt10 130 4600 gt

lt12 100 5000 gt

lt12 101 5010 gt

e profile is similar to thehistory App1 has not beeninjected in malicious code

App2 lt20 400 6000 gt

App3 lt30 500 7000 gt

App4 lt25 450 6500 gt

App5 lt19 370 5580 gt

e profile is significantly differentfrom its peer group (behavior-similar apps) App1 may be a fakeapp

Figure 5 Illustration of the procedure of malicious app detection The numbers in angle brackets are examples of features listed in Table 4

Figure 6 Example of similar apps recommended by Google Play

In practice one can resort to app stores such as GooglePlay to find out candidates that will meet condition AFigure 6 shows that when viewing the details of an appGoogle Play will recommend the similar ones These similarapps are determined by several features such as category ofapps keywords in the title and description and size of the apk(httpswwwquoracomHow-does-the-Google-Play-Store-determine-similar-apps) Generally these recommendedapps belong to the same category and have similar function-ality Therefore in order to determine the peer group of anapp we first use the similar apps recommended by app storessuch as Google Play as candidates and later filter them byconditionB

For condition B the Euclidean distance is used tomeasure the similarity of network behavior profiles Suppose119860 is the app to be analyzed and 119861 is the app satisfying con-dition A for 119860 The similarity of network behavior profilesbetween 119860 and 119861 is calculated as

119889 (119860 119861)

= radic(119899119861119875 (119860) minus 119899119861119875 (119861)) (119899119861119875 (119860) minus 119899119861119875 (119861))119879(8)

where 119899119861119875 is the abbreviation of 119899119890119905119861119890ℎ119886V119894119900119903119875119903119900119891119894119897119890 definedin (1)

Once the similarities between 119860 and the recommendedapps are calculated by (8) the results are sorted in the orderof increasing distance (ie decreasing similarity) The first119892 apps will be selected as the peer group members of 119860 asshown in (9) The optimal value of 119892 can be determined bythe experiments which is discussed in the next section Notethat the peer group members can be updated For example ifone of the peer groupmembers has been flagged asmaliciousit will be removed from the peer group and a new one will beadded AppFA can also reselect the peer groups every 119879 timeinterval

peerGroup (119860) fl app1 app119892

where 119889 (119860 app119894) ge 119889 (119860 app119895) if 119894 lt 119895(9)

As illustrated in Figures 2 and 5 for an identified appAppFA first constructs its network behavior profile and then

Security and Communication Networks 9

compares the profile with both its historical profiles andthe profiles of its peer groups for malicious apps detectionDenote 119900 as the profile of the identified app 119860 and 119866 as thematrix of the compared profiles (historical profiles or peergroup profiles) Each column of 119866 is one profile and thecolumn length is 119897 (119897 = (1 + 2 lowast (119891119900 + 119891119894) + 2 lowast (1198961 + 1198962)))AppFAmakes the feature vectors the same length by paddingzeros

For peer group analysis firstly 119900 and 119866 are normalized bymin-max normalization procedure defined in

119900119894 =119900119894

max (119900119894 1198661198941 1198661198942 119866119894119892)

119866119894119895 =119866119894119895

max (119900119894 1198661198941 1198661198942 119866119894119892)

119894 = 1 2 119897 119895 = 1 2 119892

(10)

Then the distance between 119900 and 119866 is calculated byMahalanobis distance

120591 = radic(119900 minus 119898)119879 119878minus1 (119900 minus 119898) (11)

where 119898 is the weighted mean vector of profiles

1198981198941 =119892

sum119895=1

119908119860119895119866119894119895 1 le 119894 le 119897 (12)

and 119878 is the covariance matrix and it is calculated in

119878 = 119864 (119866 minus 119864 [119866]) (119866 minus 119864 [119866])119879 (13)

In (12) 119908119860119895 is the weight of the 119895th closest peer groupmember of the analyzed app 119860 The weights of peer groupmembers are obtained from their proximity to 119860 In detailwe define the proximity of the 119895th closest peer groupmemberof the target app 119860 in

prox119860119895 = exp (minus119889 (119860 119895)) (14)

where 119889(119860 119895) is the Euclidean distance between app 119860 and its119895th closest peer group member defined in (8) Based on theproximity measure we defined above the weight of the 119895thclosest peer group member of app 119860 is defined in

119908119860119895 =prox119860119895

sum119892119895=1 prox119860119895(15)

Finally the state of apps is judged as follows

120591 gt 119905119904 malicious app

120591 le 119905119904 normal app(16)

where 119905119904 is the threshold and it can be set to different valuesby network administrators based on the actual network con-dition We evaluated different values of 119905119904 in our experiment

Internet

ServerUbuntu 1604

functions

Testing mobiledevices(HUAWEI andMI Phones)

(i) Access point(ii) Traffic collection

(iii) Data analysis

Samples(apps)

Figure 7 Experimental setup for the detection of mobile maliciousapps

6 Evaluation

61 Data Collection

Experimental Setup The experimental setup used for AppFAis shown in Figure 7 We have implemented a prototypeof AppFA with the help of ourmon [39] and CCCG [40]Ourmon is an open-source networkmonitoring and anomalydetection system and CCCG is a general framework forconstrained clustering A Ubuntu 1604 computer has beenconfigured as the access point For packet capturing thesmart phone is connected to the Internet by WIFI andnetwork traffic is collected at the access point by tcpdump(httpwwwtcpdumporg) Each pcap file is fixed up to100MB When packet capture is complete the TCP andUPD flows are split by SplitCap (httpswwwnetreseccompage=SplitCap) from pcap files For appsrsquo traffic clusteringand network behavior profile construction we use tshark(httpswwwwiresharkorgdocsman-pagestsharkhtml) toextract IP address packet sizes and packet interval timesfrom network flows After that a Python program is writtento calculate the statistical features as listed in Tables 3 and 4The KPCA transformation is accomplished with the help ofscikit-learn (httpscikit-learnorg)The data analysis is alsocompleted in theUbuntu 1604 computer (with 4GBmemoryand Pentium Dual-Core CPU T4500)

TrafficGenerationWe use the publicMalGenome [19] datasetin our experiments Since we mainly focus on repackag-ing and updating malwares 93 typical information collectionmalwares including repackaging and updating attack typesare selected from MalGenome to test the detection rate ofAppFA (the malicious apps dataset was downloaded fromhttpwwwmalgenomeprojectorg in 2015 These selectedmalwares are the ones run without any errors) We haveinstalled all these malicious apps in HUAWEI Honor 8 and aMI Note 4 phones and run these apps one by one 50 times tocollect network traffic In detail these malwares are analyzedand run by GroddDroid [41] to make sure that maliciouscodes will be triggered We also use GroddDroid to runother malwares besides the selected ones and their trafficwill be mainly used in the local detection as described in

10 Security and Communication Networks

Section 63 Particularly the background traffic includingweather forecasting email checking QQ and Webchat (QQand Webchat are the most famous apps in China) is allowedin our experiments The ground truth of the originator ofnetwork traffic is determined by Packet Capture (httpsplaygooglecomstoreappsdetailsid=appgreyshirtssslcaptureamphl=zh) to test the accuracy of app traffic clustering

For testing the false positive rates of AppFA we rely onthe GooglePlayAppsCrawlerpy project to identify the 100most popular free apps These 100 free apps have also beenrun one by one 50 times While different from malwaresthe benign apps are run by droidbot [42] to make sure togenerate enough traffic With these captured traffic we takeup all nonzero packets for app traffic clustering and networkbehavior profile construction Namely in our experimentsthe value of 119899 in Packet filter model is set to minus1

The data used in the experiments is summarized inTable 5 TrafficSet 1 consists of network traffic generated bythe selected 93 malwares and TrafficSet 2 consists of networktraffic of benign apps TrafficSet 3 is the traffic of all malwaresthus it includes TrafficSet 1 TrafficSet 3 is mainly used formalicious apps detection in local networks

62 Experimental Results

Accuracy of Traffic Clustering TrafficSet 1 is used for testingthe accuracy of app traffic clustering and Packet Capture isexploited to get the ground truth of the originator of networktraffic Recall that in Algorithm 1 app traffic clustering mustbe started with signature seeds In order to obtain appsrsquosignature seeds we randomly choose several HTTP flowsfor each app and extract the key-value pairs in their HTTPheaders as signature seeds as described in Section 4 Thevalue of cluster number 119901 in Algorithm 2 is set to 97 (97 =93 + 4) since there are 93 malwares and 4 background appsthat is weather forecasting email checking QQ andWebchatThe experimental results are shown in Table 6

As shown in Table 6 for each app when 5 HTTP flowsare chosen to generate signature seeds only 601 flows (in all223220 flows) aremisclusteredThis proves the efficiencies ofour proposed method for app traffic clustering

Experimental Results of Malicious App Detection Similarto previous work in order to measure the effectiveness ofmalicious apps detection accuracy (detection rate) and falsepositive rate metrics are defined in (17) and (18) where TPFN FP and TN stand for true positive false negative falsepositive and true negative respectively

Accuracy = TPTP + FN

(17)

False positive rate = FPFP + TN

(18)

We first examine the detection rates and false positiverates of AppFA with different values of 119905119904 and 119896119894 (119894 = 1 2)For the selected malicious apps the apps with the samefunctionality (satisfying condition A) are determined byGoogle Play We first analyze the malicious appsrsquo function-alities manually and choose a keyword for each app Then

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

05

06

07

08

09

1

Det

ectio

n ra

te

06 08 1 12 14 16 18 2 04ts

Figure 8 Detection rates of AppFA with different values of 119905119904 and119896119894 (119894 = 1 2)

we search the keyword in Google Play and use the returnedapps as candidates After that the peer group is determined by(8) and (9) In our experiments the largest value of membercount 119892 is set to 10 Note that 119905119904 is the threshold defined in(16) and 119896119894 is the number of KPCA transformations definedin Table 4 We set the member count 119892 in peer group analysisto 5 namely for each tested app its top 5 similar apps areselected and analyzedThedetection rate is shown in Figure 8and the false positive rate is depicted in Figure 9As illustratedin Figures 8 and 9 the first 4 KPCA transformations of packetsizes and intervals are good enough for constructing networkbehavior profileWhen 1198961 = 1198962 = 4 and 119905119904 gt 12 the detectionrate is higher than 90 and the false positive rate is lowerthan 04 The detection rate is as high as 97 when 119905119904 = 2The experimental results demonstrate the effectiveness of ourproposed approach

We then test AppFA with different value of 119892 The valuesof 119905119904 and 119896119894 (119894 = 1 2) are set to 2 and 4 respectivelyThe experimental results are shown in Figures 10 and 11 Asillustrated in these figures when 119892 increases the detectionrates and the false positive rates are slightly changed Thiscan be explained as the Mahalanobis distance 120591 defined in(11) is calculated by considering the weighted mean vector ofprofiles So the most similar app has the largest weight andthe added peers will have smaller weights and thus providelittle contribution to the final detection As observed in ourexperiments 119892 = 5 is suitable for practical deployment

Further AppFA with different value of cluster number 119901is also tested Note that the actual cluster number is 97 forTrafficSet 1 In order to evaluate how app traffic clusteringimpacts the detection of repackaged malware we slightlychange the value of 119901 and the experiments results are asshown in Figure 12 In the experiments we set 1198961 = 1198962 = 4119905119904 = 2 and 119892 = 5 As shown in Figure 12 the cluster numberindeed impacts the detection rate of AppFA Particularlywhen 119901 = 99 the detection rate is only 88 namely only 82

Security and Communication Networks 11

Table 5 Data summary

Traffic set Apps Quantity Number of collectedflows (TCP and UPD) Amount of data (bytes)

TrafficSet 1 Selected malicious apps(for testing accuracy of traffic clustering and detection rates) 93 223220 538G

TrafficSet 2 Benign apps(for testing false positive rates of AppFA) 100 305103 640G

TrafficSet 3 All malicious apps(for malicious apps detection in local networks) 1260 asymp31 million 1872G

Table 6 Experimental results of traffic clustering

Number of HTTP flows forgenerating signature seeds

Number of flowscorrectly clustered

Number ofmisclustered

flows2 201357 218633 218946 42744 221384 18365 222619 601

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

00005

0001

00015

0002

00025

0003

00035

0004

False

pos

itive

rate

06 08 1 12 14 16 18 2 04ts

Figure 9 False positive rates of AppFA with different values of 119905119904and 119896119894(119894 = 1 2)

malicious apps are correctly detected However when 119901 lt 97(119901 = 95 and 96 in Figure 12) the detection rates are almost thesame namely 97Therefore AppFAwill be competent witha few unknown apps when carrying out app traffic clustering

Finally we evaluate AppFA with the remaining maliciousapps besides the selected 93 samplesThe rest of themaliciousapps contain repackaged and other types of malwares Thesemalwares are found to have some errors when run on ourphones However they also produced considerable networktraffic We test AppFA with these apps for examining thegeneral capability of AppFA The parameters are set up asfollows 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 The experimentalresults are shown in Table 7 The detection rate is 734and the false positive rate is 35 Compared to the results

1 3 5 7 10g

Detection rate

0

01

02

03

04

05

06

07

08

09

1

rate

()

Figure 10 Detection rates of AppFA with different values of 119892

1 3 5 7 10g

False positive rate

times10minus3

0

1

2

3

4

5

6

7

rate

()

Figure 11 False positive rates of AppFA with different values of 119892

displayed in Figures 8 9 10 and 11 the detection rate issignificantly reduced and the false positive rate is increasedThe possible reasons will be discussed in Section 64

Note that AppFA is designed to perform nearly real-timemalicious apps detection thus efficiency is also a big concernTable 8 presents the computational performance for major

12 Security and Communication Networks

99989695value of p

0

01

02

03

04

05

06

07

08

09

1

dete

ctio

n ra

te (

)

Figure 12 Detection rates of AppFA with different values of 119901

Table 7 Experimental results for the malicious apps besides theselected 93 samples

Detection rate False positive rate734 35

Table 8 The computational performance

Procedures TimeConstrained flow clustering (Algorithm 1) 3minApp identification (Algorithm 2) 52minNetwork behavior profile construction 17minPeer group analysis 45 sec

steps in terms of average running time using the experimentaldata Generally the proposedmethod can be deployed onlineto detect malicious apps Specifically it is fast typically lessthan one minute to exploit peer group analysis to performmalicious app detection The app identification procedureis the most time-consuming step because the signaturematching and constrained flow clustering are carried out oneach flow Parallel computing such as cloud resources can befurther used to speed the malicious apps detection and scaleup to more Internet traffic data

Detection of Malicious Repackaged Apps for Encrypted TrafficAs shown in Table 4 AppFA does not consider traffic contentand mainly uses traffic statistical features to detect maliciousAndroid apps Thus AppFA can deal with encrypted trafficSince in MalGenome there is no app whose all connectionsto attack servers are encrypted we have created a newAndroid repackaged malware that communicates with allmalicious servers by encrypted connections In detail wefirst modify AnserverBotrsquos source code and make sure itsnetwork connectionswill be encrypted byTLS protocolThenwe graft it into a popular game app aircomaceviralmotox3Finally we collect network traffic of the repackaged aircomaceviralmotox3 and its peer group In the experiments

Table 9 Experimental results when searching similar apps locally

Detection rate False positive rate692 47

the selected peer group of the repackaged aircomaceviralmotox3 is comwordmobilesbikeRacing comtopfreegamesbikeracefreeworld comskgamestrafficrider comtomicowheel-iechallenge and madmotoracingbike We repeat to detectthe repackaged aircomaceviralmotox3 10 times and thedetection rate is 100 The results show that AppFA canhandle encrypted network traffic

63 Detection in Local Networks In the above experimentsthe peer group for each app is determinedwith the help of appstore such as theGoogle PlayThatmeansAppFAhas to accessthe Internet when performing malicious apps detection Infact AppFA can also be enhanced to choose the appsrsquo peergroups locally namely from the set of already identified appsNote that AppFA uses signature matching to identify appson the network and the identified ones can be treated as amini app store By this AppFA can work on local networksTo find out similar apps from the already identified ones anefficient method based on information retrieval technologiesis proposed and is shown in Algorithm 3

The basic idea of Algorithm 3 is that similar apps mayhave similar functionalities and the context (keywords) maybe similarTherefore we useweb-page searching technologiesto match similar app The 119872119886119905119888ℎ119894119899119892 function in line (4) canbe realized by TF-IDF (Term Frequency-Inverse DocumentFrequency) [43] Again peer groups can be chose by (8) withthe results returned by Algorithm 3

We have implemented the local similar app searching andevaluated AppFArsquos performance in local networks All appsincluding all malicious and benign apps are considered asidentified apps and Algorithm 3 is carried out to find outsimilar apps When 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 for the93 typical repackaged malwares the experimental results arelisted in Table 9 The detection rate is 692 lower than 97in Figure 8The experimental results indicate that AppFA canwork in local networks and the proper peer groups do affectthe final detection significantly

64 Discussion Figures 8 9 10 and 11 show that AppFAgets high detection rates and low false positive rates whendetecting repackaged apps However the detection rate maybecome lower when it is applied to other types of maliciousapps such as sending SMS without notification This may bedue to the fact that other types of mobile malwares transmitless data through network than repackaged apps Note thatthe detection features used inAppFA aremainly related to thepackets sizes and the number of flows So the detection ratemay be declined if the appsrsquo network behaviors are slightlychanged

Compared to the latest work done in [6] AppFA has aslightly lower detection rate (the detection rate is 95ndash999 asreported in [6]) However Garg et alrsquos method [6] consideredonly 18 malware apps and 14 genuine apps and assumed

Security and Communication Networks 13

(1) extract all plaintext keywords from HTTP headers of thetested app 119860 Denote the set of keywords as 119882(119860)

(2) extract all plaintext keywords from HTTP headers of otheridentified apps Denote the set of keywords of identifiedapp 119894 as 119882(119894) 119894 = 0 1 119899 minus 1

(3) while 0 le 119899 do(4) 119878119888119900119903119890119904[119894]V119886119897119906119890 = 119872119886119905119888ℎ119894119899119892(119882(119860) 119882(119894))(5) 119878119888119900119903119890119904[119894]119886119901119901 = 119894(6) 119894 + +(7) end while(8) sorting 119878119888119900119903119890119904[] with 119878119888119900119903119890119904[119894]V119886119897119906119890 by decreasing order(9) return the first 119898119878119888119900119903119890119904[119894]119886119901119901 119894 = 0 1 119898 minus 1

Algorithm 3 Searching similar apps locally

Table 10 Experimental results when the fact that what traffic isgenerated by what app is exactly known In the experiments 1198961 =1198962 = 4 119905119904 = 2 and 119892 = 5

Selected maliciousapps

Remaining maliciousapps

Detection rate 983 85False positiverate 003 27

that the traffic generated by apps was already known (theirdata were collected on the mobile device so their methodis actually a kind of client-side approach) In our workmuch larger samples are taken into account and constrainedclustering is exploited to determine to what app the flowsbelong The accuracy of clustering is also a factor affectingthe detection rate To confirm this we further take the sameassumption as Garg et alrsquos method [6] namely we assumethe accuracy of Algorithm 1 is 100 and know exactly whattraffic is generated by what app (in our experiments we useapp Packet Capture to obtain perfect ground truth of whatflows came from what app) At this time the experimentalresults are given in Table 10 Apparently the detection ratesare much improved

In practice one can deploy both our method and othermethods such as [6 15] simultaneously As indicated by ourexperiments AppFA is efficient in detecting repackaged andself-updatingmalicious apps In fact [15] is suitable forHTTPanalysis and [6] can detect CampC communication efficientlySo our work can be a complement of existing work onmalicious app detection

In this work AppFA mainly uses the statistical features(refer to Table 4) of network traffic to detect Android mali-cious repackaged applications Therefore attackers (repack-aged apps) may change the characteristics of their traffic toremain undetectable But it is quite difficult in practice Asillustrated in Section 2 repackaged apps usually introduceadditional network traffic thus the attackers must removesome normal network connections to keep network behav-iors the same to evade detection However the removal of

normal network connections will impact the functionalitiesof apps and may cause errors This gives additional clues todetect repackaged malware Meanwhile the disappearanceof some network connections may be abnormal as wellTherefore our proposed method is hard to evade

7 Conclusion

In this paper we propose a novel approach AppFA to detectmalicious apps at the network level In AppFA apps are firstidentified from network traffic by signature matching andconstrained clustering Then Kernel Principal ComponentAnalysis is employed to construct app network behavior pro-file and distinguish minor traffic variations from significantdifferences At last we take advantage of peer group analysisto detect malicious apps to avoid time-consuming offlinemodel training Notably AppFA does not need to installprograms or modify operating systems to collect featureinformation Thus it is very convenient to be used and thecost is low The experimental results show that AppFA candetect Android repackaged malware with the detection ratehigher than 90 and a false positive rate lower than 04

The appsrsquo network behaviorsmay be significantly changedby version update and thus cause false positives This needsto be further investigated In the future work AppFA willbe extended to include more network traffic features anddetect more types of malicious apps We will also makeAppFA an open-source project to facilitate further researchesof malicious app detection

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by National Natural Science Founda-tion of China under Grants 61702282 and 61502250 NaturalScience Foundation of the Jiangsu Higher Education Insti-tutions of China under Grant 17KJB520023 NUPTSF underGrant NY217143 and Nanjing Forestry University (GXL016CX2016026)

14 Security and Communication Networks

References

[1] Android statistics Number of android applications September2017 httpwwwappbraincomstatsnumber-of-android-apps

[2] A Mylonas M Theoharidou and D Gritzalis ldquoAssessingPrivacy Risks in Android A User-Centric Approachrdquo in RiskAssessment and Risk-Driven Testing Lecture Notes in ComputerScience pp 21ndash37 Springer International Publishing Cham2014

[3] Does Does your mobile anti-virus app protect or infect you thetruth behind du antivirus security September 2017 httpsresearchcheckpointcommobile-anti-virus-app-protect-infect-truth-behind-du-antivirus-security

[4] A P Felt M Finifter E Chin S Hanna and D Wagner ldquoAsurvey of mobile malware in the wildrdquo in Proceedings of the 1stACM Workshop on Security and Privacy in Smartphones andMobile Devices (SPSM rsquo11) held in Association with the 18th ACMConference on Computer and Communications Security (CCSrsquo11) pp 3ndash14 October 2011

[5] K Tam A Feizollah N B Anuar R Salleh and L Caval-laro ldquoThe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys vol 49 no 4 Article ID3017427 2017

[6] S Garg S K Peddoju and A K Sarje ldquoNetwork-based detec-tion of Android malicious appsrdquo International Journal of Infor-mation Security pp 1ndash16 2016

[7] A Shabtai L Tenenboim-Chekina D Mimran L Rokach BShapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquo Com-puters amp Security vol 43 pp 1ndash18 2014

[8] L Xie X Zhang J-P Seifert and S Zhu ldquoPBMDS a behavior-based malware detection system for cellphone devicesrdquo inProceedings of the 3rd ACM Conference on Wireless NetworkSecurity (WiSec rsquo10) pp 37ndash48 Hoboken NJ USAMarch 2010

[9] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacy inSmartphones and Mobile Devices (SPSM rsquo11) Held in Associationwith the 18th ACM Conference on Computer and Communica-tions Security (CCS rsquo11) pp 15ndash25 October 2011

[10] GDini FMartinelli A Saracino andD Sgandurra ldquoMADAMa multi-level anomaly detector for android malwarerdquo in Com-puter Network Security vol 7531 of Lecture Notes in ComputerScience pp 240ndash253 Springer Berlin Germany 2012

[11] Y Zhang M Yang B Xu et al ldquoVetting undesirable behaviorsin android apps with permission use analysisrdquo in Proceedingsof the in Proceedings of the 2013 ACM SIGSAC conference onComputer amp communications security pp 611ndash622 2013

[12] S Wang Z Chen X Li L Wang K Ji and C Zhao ldquoAndroidMalware Clustering Analysis on Network-Level Behaviorrdquo inIntelligent Computing Theories and Application vol 10361 ofLecture Notes in Computer Science pp 796ndash807 SpringerInternational Publishing Cham Switzerland 2017

[13] A S Shamili C Bauckhage and T Alpcan ldquoMalware detectionon mobile devices using distributed machine learningrdquo inProceedings of the 20th International Conference on PatternRecognition (ICPR rsquo10) pp 4348ndash4351 August 2010

[14] Y Wang J Wei and K Vangury ldquoBring your own device secu-rity issues and challengesrdquo in Proceedings of the 2014 IEEE 11thConsumer Communications and Networking Conference CCNC2014 pp 80ndash85 USA January 2014

[15] P S Chen S-C Lin and C-H Sun ldquoSimple and effectivemethod for detecting abnormal internet behaviors of mobiledevicesrdquo Information Sciences vol 321 Article ID 11536 pp 193ndash204 2015

[16] Global mobile os market share in sales to end users from 1stquarter 2009 to 1st quarter 2017 2017 httpswwwstatistacomstatistics266136global-market-share-held-by-smartphone-op-eratingsystems

[17] ldquoIt threat evolution q3 2017 statisticsrdquo November 2017 httpssecurelistcomit-threat-evolution-q3-2017-statistics83131

[18] L Li D Li T F Bissyande et al ldquoUnderstanding Android AppPiggybacking A Systematic Study of Malicious Code GraftingrdquoIEEE Transactions on Information Forensics and Security vol 12no 6 pp 1269ndash1284 2017

[19] Y Zhou and X Jiang ldquoDissecting android malware characteri-zation and evolutionrdquo in Proceedings of the 33rd IEEE Sympo-sium on Security and Privacy pp 95ndash109 San Francisco CalifUSA May 2012

[20] P Faruki V Laxmi A Bharmal M S Gaur and V Gan-moor ldquoAndroSimilar Robust signature for detecting variantsof Android malwarerdquo Journal of Information Security andApplications vol 22 pp 66ndash80 2015

[21] Y Zhou and X JiangAn analysis of the anserverbot trojan 2011httpwwwcscncsuedufacultyjiangpubsAnserverBot Ana-lysispdf

[22] P Faruki A Bharmal V Laxmi et al ldquoAndroid security a sur-vey of issues malware penetration and defensesrdquo IEEE Com-munications Surveys amp Tutorials vol 17 no 2 pp 998ndash10222015

[23] P Yan and Z Yan ldquoA survey on dynamic mobile malwaredetectionrdquo Software Quality Journal pp 1ndash29 2017

[24] S Dai Y Liu T Wang T Wei and W Zou ldquoBehavior-basedmalware detection on mobile phonerdquo in Proceedings of the 20106th International Conference on Wireless Communications Net-working and Mobile Computing WiCOM 2010 China Septem-ber 2010

[25] D Damopoulos G Kambourakis S Gritzalis and S O ParkldquoExposing mobile malware from the inside (or what is yourmobile app really doing)rdquo Peer-to-Peer Networking and Appli-cations vol 7 no 4 pp 687ndash697 2014

[26] D Damopoulos G Kambourakis and G Portokalidis ldquoThebest of both worlds A framework for the synergistic operationof host and cloud anomaly-based IDS for smartphonesrdquo inProceedings of the 7th European Workshop on System SecurityEuroSec 2014 Netherlands April 2014

[27] Q Xu J Erman A Gerber Z Mao J Pang and S Venkatara-man ldquoIdentifying diverse usage behaviors of smartphone appsrdquoin Proceedings of the ACM SIGCOMM Internet MeasurementConference (IMC rsquo11) pp 329ndash344 November 2011

[28] S Dai A Tongaonkar X Wang A Nucci and D Song ldquoNet-workProfiler Towards automatic fingerprinting of Androidappsrdquo in Proceedings of the 32nd IEEE Conference on ComputerCommunications IEEE INFOCOM 2013 pp 809ndash817 ItalyApril 2013

[29] Q Xu T Andrews Y Liao et al ldquoFlowr A self-learning systemfor classifying mobile application trafficrdquo in Proceedings of the2014 ACM SIGMETRICS International Conference on Measure-ment and Modeling of Computer Systems SIGMETRICS 2014pp 569-570 usa June 2014

[30] Q Xu Y Liao S Miskovic et al ldquoAutomatic generation ofmobile app signatures from traffic observationsrdquo in Proceedings

Security and Communication Networks 15

of the 34th IEEE Annual Conference on Computer Communica-tions and Networks IEEE INFOCOM 2015 pp 1481ndash1489 HongKong May 2015

[31] J Sun L She H Chen et al ldquoAutomatically identifying appsin mobile trafficrdquo Concurrency Computation vol 28 no 14 pp3927ndash3941 2016

[32] G He B Xu and H Zhu ldquoIdentifying Mobile Applications forEncrypted Network Trafficrdquo in Proceedings of the 5th Interna-tional Conference on Advanced Cloud and Big Data CBD 2017pp 279ndash284 China August 2017

[33] G He M Yang X Gu J Luo and Y Ma ldquoA novel activewebsite fingerprinting attack against Tor anonymous systemrdquoin Proceedings of the 2014 18th IEEE International Conference onComputer Supported CooperativeWork inDesign CSCWD2014pp 112ndash117 Taiwan May 2014

[34] G He M Yang J Luo and X Gu ldquoA novel application classi-fication attack against Torrdquo Concurrency and ComputationPractice and Experience vol 27 no 18 pp 5640ndash5661 2015

[35] B Miller L Huang A D Joseph and J D Tygar ldquoI know whyyou went to the clinic Risks and realization of HTTPS trafficanalysisrdquo Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes in Bio-informatics) Preface vol 8555 pp 143ndash163 2014

[36] Y Wang Y Xiang J Zhang W Zhou G Wei and L T YangldquoInternet traffic classification using constrained clusteringrdquoIEEE Transactions on Parallel and Distributed Systems vol 25no 11 pp 2932ndash2943 2014

[37] L J Cao K S Chua W K Chong H P Lee and Q M Gu ldquoAcomparison of PCA KPCA and ICA for dimensionality reduc-tion in support vector machinerdquo Neurocomputing vol 55 no1-2 pp 321ndash336 2003

[38] Y Kim and S Y Sohn ldquoStock fraud detection using peer groupanalysisrdquo Expert Systems with Applications vol 39 no 10 pp8986ndash8992 2012

[39] ldquoOurmon network monitoring and anomaly detection systemrdquoApril 2013 httpourmonsourceforgenet

[40] B Babaki T Guns and S Nijssen ldquoConstrained ClusteringUsing Column Generationrdquo in Integration of AI and OR Tech-niques in Constraint Programming vol 8451 of Lecture Notes inComputer Science pp 438ndash454 Springer International Publish-ing Cham 2014

[41] A Abraham R Andriatsimandefitra A Brunelat J-F Lalandeand V Viet Triem Tong ldquoGroddDroid A gorilla for triggeringmalicious behaviorsrdquo in Proceedings of the 10th InternationalConference on Malicious and Unwanted Software MALWARE2015 pp 119ndash127 USA October 2015

[42] Y Li Z Yang Y Guo and X Chen ldquoDroidBot A lightweightUI-guided test input generator for androidrdquo inProceedings of the39th IEEEACM International Conference on Software Engineer-ing Companion ICSE-C 2017 pp 23ndash26 Argentina May 2017

[43] A Aizawa ldquoAn information-theoretic perspective of tf-idfmeasuresrdquo Information Processing amp Management vol 39 no1 pp 45ndash65 2003

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 7: AppFA: A Novel Approach to Detect Malicious …downloads.hindawi.com/journals/scn/2018/2854728.pdfSecurityandCommunicationNetworks F :Exampleofkey-valuepairs. signs, and the packet

Security and Communication Networks 7

Input Data set 119865 cluster number 119901 must-link constraints119871119898 sube 119865 times 119865 cannot-link constraints 119871 119888 sube 119865 times 119865

Output Flow clusters(1) Let 1198881 sdot sdot sdot 119888119901 be the initial cluster centers(2) for each flow 119891119894 in 119865 do(3) select the closest cluster 119888(4) for each (119891119894 119891) isin 119871119898 do(5) if119891 notin 119888 then(6) goto step (2)(7) end if(8) end for(9) for each (119891119894 119891) isin 119871 119888 do(10) if119891 isin 119888 then(11) goto step (2)(12) end if(13) end for(14) assign 119891119894 to the cluster 119888(15) end for(16) for each cluster 119888119894 do(17) update its center by averaging all of the flows 119891119895 that

have been assigned to it(18) end for(19) iterate between step (2) and step (18) until convergence(20) return 1198881 sdot sdot sdot 119888119901

Algorithm 2 Constrained flow clustering algorithm

is one approach of generalizing linear PCA into nonlinearcase using the kernel method It has been proved that KPCAhas the best performance in feature extraction and is robust tonoise [37] The details of KPCA used in AppFA are describedas follows

First Gaussian function defined in (2) is selected as thekernel function

119892 (119909 119910) = exp (minus120574 1003817100381710038171003817119909 minus 11991010038171003817100381710038172) (2)

where the value of 120574 is set to 0001 as indicated in [37]Then we compute a Gramkernel matrix 119870 with

119870119894119895 = 119892 (119909(119894) 119909(119895)) (3)

Next the kernel matrix 119870 is centered via the followingfunction

119870centered = 119870 minus 1119873119870 minus 1198701119873 + 1119873119870= (119868 minus 1119873) 119870 (119868 minus 1119873)

(4)

where 1119873 is an 119873 times 119873 matrix with all elements equal to 1119873and 119873 is the number of data points

After that the nonzero eigenvalues 120582119894 and the eigenvec-tors 120572119894 of the centered kernel matrix 119870centered are calculated asfollows

119870centered120572119894 = 120582119894120572119894 (5)

Also the eigenvectors 120572119894 are normalized as

120572119894 =1

radic120582119894119873120572119894 (6)

Finally we sort the eigenvectors in the descending orderof corresponding eigenvalues and perform projections ontothe given subset of eigenvectors This step can be representedas follows

119889119895 =119873

sum119894=1

120572119895119894119892 (119909 119909119894) 119895 = 1 2 119896 (7)

where 119896 is the dimension of the new dataFor each app the length of its network behavior profile is

(1 + 2 lowast (119891119900 + 119891119894) + 2 lowast (1198961 + 1198962)) and 119891119900 is the number ofoutgoing flows and 119891119894 is the number of incoming flows

53 Malicious App Detection With appsrsquo network behaviorprofiles we propose to use peer group analysis [38] to detectmalicious apps The main idea of the detection method isillustrated in Figure 5 An apprsquos network behavior profileis compared with both its historical data and the profilesof its peer group for malware detection Comparison withthe historical profiles can help us find out self-updatingmalicious apps [7] and comparison with the profiles of itspeer group can detect repackagedmalicious apps as observedin Section 2

For AppFA an apprsquos peer group is defined as the set of itsbehavior-similar apps Apps are considered behavior-similarif they satisfy the following two conditions A their mainfunctionality is similar for example all for mailing B theirnetwork behavior profiles are similar If any of the above twoconditions is not met the peer group will be empty and wewill only compare appsrsquo profiles with their historical data

8 Security and Communication Networks

App1 for detection

Update the peergroup of App1

Detection

App1 lt12 100 5000 gt

Network behaviorprofile comparison

Network behaviorprofile comparison

Historical data Peer grouplt11 110 5020 gt

lt10 130 4600 gt

lt12 100 5000 gt

lt12 101 5010 gt

e profile is similar to thehistory App1 has not beeninjected in malicious code

App2 lt20 400 6000 gt

App3 lt30 500 7000 gt

App4 lt25 450 6500 gt

App5 lt19 370 5580 gt

e profile is significantly differentfrom its peer group (behavior-similar apps) App1 may be a fakeapp

Figure 5 Illustration of the procedure of malicious app detection The numbers in angle brackets are examples of features listed in Table 4

Figure 6 Example of similar apps recommended by Google Play

In practice one can resort to app stores such as GooglePlay to find out candidates that will meet condition AFigure 6 shows that when viewing the details of an appGoogle Play will recommend the similar ones These similarapps are determined by several features such as category ofapps keywords in the title and description and size of the apk(httpswwwquoracomHow-does-the-Google-Play-Store-determine-similar-apps) Generally these recommendedapps belong to the same category and have similar function-ality Therefore in order to determine the peer group of anapp we first use the similar apps recommended by app storessuch as Google Play as candidates and later filter them byconditionB

For condition B the Euclidean distance is used tomeasure the similarity of network behavior profiles Suppose119860 is the app to be analyzed and 119861 is the app satisfying con-dition A for 119860 The similarity of network behavior profilesbetween 119860 and 119861 is calculated as

119889 (119860 119861)

= radic(119899119861119875 (119860) minus 119899119861119875 (119861)) (119899119861119875 (119860) minus 119899119861119875 (119861))119879(8)

where 119899119861119875 is the abbreviation of 119899119890119905119861119890ℎ119886V119894119900119903119875119903119900119891119894119897119890 definedin (1)

Once the similarities between 119860 and the recommendedapps are calculated by (8) the results are sorted in the orderof increasing distance (ie decreasing similarity) The first119892 apps will be selected as the peer group members of 119860 asshown in (9) The optimal value of 119892 can be determined bythe experiments which is discussed in the next section Notethat the peer group members can be updated For example ifone of the peer groupmembers has been flagged asmaliciousit will be removed from the peer group and a new one will beadded AppFA can also reselect the peer groups every 119879 timeinterval

peerGroup (119860) fl app1 app119892

where 119889 (119860 app119894) ge 119889 (119860 app119895) if 119894 lt 119895(9)

As illustrated in Figures 2 and 5 for an identified appAppFA first constructs its network behavior profile and then

Security and Communication Networks 9

compares the profile with both its historical profiles andthe profiles of its peer groups for malicious apps detectionDenote 119900 as the profile of the identified app 119860 and 119866 as thematrix of the compared profiles (historical profiles or peergroup profiles) Each column of 119866 is one profile and thecolumn length is 119897 (119897 = (1 + 2 lowast (119891119900 + 119891119894) + 2 lowast (1198961 + 1198962)))AppFAmakes the feature vectors the same length by paddingzeros

For peer group analysis firstly 119900 and 119866 are normalized bymin-max normalization procedure defined in

119900119894 =119900119894

max (119900119894 1198661198941 1198661198942 119866119894119892)

119866119894119895 =119866119894119895

max (119900119894 1198661198941 1198661198942 119866119894119892)

119894 = 1 2 119897 119895 = 1 2 119892

(10)

Then the distance between 119900 and 119866 is calculated byMahalanobis distance

120591 = radic(119900 minus 119898)119879 119878minus1 (119900 minus 119898) (11)

where 119898 is the weighted mean vector of profiles

1198981198941 =119892

sum119895=1

119908119860119895119866119894119895 1 le 119894 le 119897 (12)

and 119878 is the covariance matrix and it is calculated in

119878 = 119864 (119866 minus 119864 [119866]) (119866 minus 119864 [119866])119879 (13)

In (12) 119908119860119895 is the weight of the 119895th closest peer groupmember of the analyzed app 119860 The weights of peer groupmembers are obtained from their proximity to 119860 In detailwe define the proximity of the 119895th closest peer groupmemberof the target app 119860 in

prox119860119895 = exp (minus119889 (119860 119895)) (14)

where 119889(119860 119895) is the Euclidean distance between app 119860 and its119895th closest peer group member defined in (8) Based on theproximity measure we defined above the weight of the 119895thclosest peer group member of app 119860 is defined in

119908119860119895 =prox119860119895

sum119892119895=1 prox119860119895(15)

Finally the state of apps is judged as follows

120591 gt 119905119904 malicious app

120591 le 119905119904 normal app(16)

where 119905119904 is the threshold and it can be set to different valuesby network administrators based on the actual network con-dition We evaluated different values of 119905119904 in our experiment

Internet

ServerUbuntu 1604

functions

Testing mobiledevices(HUAWEI andMI Phones)

(i) Access point(ii) Traffic collection

(iii) Data analysis

Samples(apps)

Figure 7 Experimental setup for the detection of mobile maliciousapps

6 Evaluation

61 Data Collection

Experimental Setup The experimental setup used for AppFAis shown in Figure 7 We have implemented a prototypeof AppFA with the help of ourmon [39] and CCCG [40]Ourmon is an open-source networkmonitoring and anomalydetection system and CCCG is a general framework forconstrained clustering A Ubuntu 1604 computer has beenconfigured as the access point For packet capturing thesmart phone is connected to the Internet by WIFI andnetwork traffic is collected at the access point by tcpdump(httpwwwtcpdumporg) Each pcap file is fixed up to100MB When packet capture is complete the TCP andUPD flows are split by SplitCap (httpswwwnetreseccompage=SplitCap) from pcap files For appsrsquo traffic clusteringand network behavior profile construction we use tshark(httpswwwwiresharkorgdocsman-pagestsharkhtml) toextract IP address packet sizes and packet interval timesfrom network flows After that a Python program is writtento calculate the statistical features as listed in Tables 3 and 4The KPCA transformation is accomplished with the help ofscikit-learn (httpscikit-learnorg)The data analysis is alsocompleted in theUbuntu 1604 computer (with 4GBmemoryand Pentium Dual-Core CPU T4500)

TrafficGenerationWe use the publicMalGenome [19] datasetin our experiments Since we mainly focus on repackag-ing and updating malwares 93 typical information collectionmalwares including repackaging and updating attack typesare selected from MalGenome to test the detection rate ofAppFA (the malicious apps dataset was downloaded fromhttpwwwmalgenomeprojectorg in 2015 These selectedmalwares are the ones run without any errors) We haveinstalled all these malicious apps in HUAWEI Honor 8 and aMI Note 4 phones and run these apps one by one 50 times tocollect network traffic In detail these malwares are analyzedand run by GroddDroid [41] to make sure that maliciouscodes will be triggered We also use GroddDroid to runother malwares besides the selected ones and their trafficwill be mainly used in the local detection as described in

10 Security and Communication Networks

Section 63 Particularly the background traffic includingweather forecasting email checking QQ and Webchat (QQand Webchat are the most famous apps in China) is allowedin our experiments The ground truth of the originator ofnetwork traffic is determined by Packet Capture (httpsplaygooglecomstoreappsdetailsid=appgreyshirtssslcaptureamphl=zh) to test the accuracy of app traffic clustering

For testing the false positive rates of AppFA we rely onthe GooglePlayAppsCrawlerpy project to identify the 100most popular free apps These 100 free apps have also beenrun one by one 50 times While different from malwaresthe benign apps are run by droidbot [42] to make sure togenerate enough traffic With these captured traffic we takeup all nonzero packets for app traffic clustering and networkbehavior profile construction Namely in our experimentsthe value of 119899 in Packet filter model is set to minus1

The data used in the experiments is summarized inTable 5 TrafficSet 1 consists of network traffic generated bythe selected 93 malwares and TrafficSet 2 consists of networktraffic of benign apps TrafficSet 3 is the traffic of all malwaresthus it includes TrafficSet 1 TrafficSet 3 is mainly used formalicious apps detection in local networks

62 Experimental Results

Accuracy of Traffic Clustering TrafficSet 1 is used for testingthe accuracy of app traffic clustering and Packet Capture isexploited to get the ground truth of the originator of networktraffic Recall that in Algorithm 1 app traffic clustering mustbe started with signature seeds In order to obtain appsrsquosignature seeds we randomly choose several HTTP flowsfor each app and extract the key-value pairs in their HTTPheaders as signature seeds as described in Section 4 Thevalue of cluster number 119901 in Algorithm 2 is set to 97 (97 =93 + 4) since there are 93 malwares and 4 background appsthat is weather forecasting email checking QQ andWebchatThe experimental results are shown in Table 6

As shown in Table 6 for each app when 5 HTTP flowsare chosen to generate signature seeds only 601 flows (in all223220 flows) aremisclusteredThis proves the efficiencies ofour proposed method for app traffic clustering

Experimental Results of Malicious App Detection Similarto previous work in order to measure the effectiveness ofmalicious apps detection accuracy (detection rate) and falsepositive rate metrics are defined in (17) and (18) where TPFN FP and TN stand for true positive false negative falsepositive and true negative respectively

Accuracy = TPTP + FN

(17)

False positive rate = FPFP + TN

(18)

We first examine the detection rates and false positiverates of AppFA with different values of 119905119904 and 119896119894 (119894 = 1 2)For the selected malicious apps the apps with the samefunctionality (satisfying condition A) are determined byGoogle Play We first analyze the malicious appsrsquo function-alities manually and choose a keyword for each app Then

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

05

06

07

08

09

1

Det

ectio

n ra

te

06 08 1 12 14 16 18 2 04ts

Figure 8 Detection rates of AppFA with different values of 119905119904 and119896119894 (119894 = 1 2)

we search the keyword in Google Play and use the returnedapps as candidates After that the peer group is determined by(8) and (9) In our experiments the largest value of membercount 119892 is set to 10 Note that 119905119904 is the threshold defined in(16) and 119896119894 is the number of KPCA transformations definedin Table 4 We set the member count 119892 in peer group analysisto 5 namely for each tested app its top 5 similar apps areselected and analyzedThedetection rate is shown in Figure 8and the false positive rate is depicted in Figure 9As illustratedin Figures 8 and 9 the first 4 KPCA transformations of packetsizes and intervals are good enough for constructing networkbehavior profileWhen 1198961 = 1198962 = 4 and 119905119904 gt 12 the detectionrate is higher than 90 and the false positive rate is lowerthan 04 The detection rate is as high as 97 when 119905119904 = 2The experimental results demonstrate the effectiveness of ourproposed approach

We then test AppFA with different value of 119892 The valuesof 119905119904 and 119896119894 (119894 = 1 2) are set to 2 and 4 respectivelyThe experimental results are shown in Figures 10 and 11 Asillustrated in these figures when 119892 increases the detectionrates and the false positive rates are slightly changed Thiscan be explained as the Mahalanobis distance 120591 defined in(11) is calculated by considering the weighted mean vector ofprofiles So the most similar app has the largest weight andthe added peers will have smaller weights and thus providelittle contribution to the final detection As observed in ourexperiments 119892 = 5 is suitable for practical deployment

Further AppFA with different value of cluster number 119901is also tested Note that the actual cluster number is 97 forTrafficSet 1 In order to evaluate how app traffic clusteringimpacts the detection of repackaged malware we slightlychange the value of 119901 and the experiments results are asshown in Figure 12 In the experiments we set 1198961 = 1198962 = 4119905119904 = 2 and 119892 = 5 As shown in Figure 12 the cluster numberindeed impacts the detection rate of AppFA Particularlywhen 119901 = 99 the detection rate is only 88 namely only 82

Security and Communication Networks 11

Table 5 Data summary

Traffic set Apps Quantity Number of collectedflows (TCP and UPD) Amount of data (bytes)

TrafficSet 1 Selected malicious apps(for testing accuracy of traffic clustering and detection rates) 93 223220 538G

TrafficSet 2 Benign apps(for testing false positive rates of AppFA) 100 305103 640G

TrafficSet 3 All malicious apps(for malicious apps detection in local networks) 1260 asymp31 million 1872G

Table 6 Experimental results of traffic clustering

Number of HTTP flows forgenerating signature seeds

Number of flowscorrectly clustered

Number ofmisclustered

flows2 201357 218633 218946 42744 221384 18365 222619 601

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

00005

0001

00015

0002

00025

0003

00035

0004

False

pos

itive

rate

06 08 1 12 14 16 18 2 04ts

Figure 9 False positive rates of AppFA with different values of 119905119904and 119896119894(119894 = 1 2)

malicious apps are correctly detected However when 119901 lt 97(119901 = 95 and 96 in Figure 12) the detection rates are almost thesame namely 97Therefore AppFAwill be competent witha few unknown apps when carrying out app traffic clustering

Finally we evaluate AppFA with the remaining maliciousapps besides the selected 93 samplesThe rest of themaliciousapps contain repackaged and other types of malwares Thesemalwares are found to have some errors when run on ourphones However they also produced considerable networktraffic We test AppFA with these apps for examining thegeneral capability of AppFA The parameters are set up asfollows 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 The experimentalresults are shown in Table 7 The detection rate is 734and the false positive rate is 35 Compared to the results

1 3 5 7 10g

Detection rate

0

01

02

03

04

05

06

07

08

09

1

rate

()

Figure 10 Detection rates of AppFA with different values of 119892

1 3 5 7 10g

False positive rate

times10minus3

0

1

2

3

4

5

6

7

rate

()

Figure 11 False positive rates of AppFA with different values of 119892

displayed in Figures 8 9 10 and 11 the detection rate issignificantly reduced and the false positive rate is increasedThe possible reasons will be discussed in Section 64

Note that AppFA is designed to perform nearly real-timemalicious apps detection thus efficiency is also a big concernTable 8 presents the computational performance for major

12 Security and Communication Networks

99989695value of p

0

01

02

03

04

05

06

07

08

09

1

dete

ctio

n ra

te (

)

Figure 12 Detection rates of AppFA with different values of 119901

Table 7 Experimental results for the malicious apps besides theselected 93 samples

Detection rate False positive rate734 35

Table 8 The computational performance

Procedures TimeConstrained flow clustering (Algorithm 1) 3minApp identification (Algorithm 2) 52minNetwork behavior profile construction 17minPeer group analysis 45 sec

steps in terms of average running time using the experimentaldata Generally the proposedmethod can be deployed onlineto detect malicious apps Specifically it is fast typically lessthan one minute to exploit peer group analysis to performmalicious app detection The app identification procedureis the most time-consuming step because the signaturematching and constrained flow clustering are carried out oneach flow Parallel computing such as cloud resources can befurther used to speed the malicious apps detection and scaleup to more Internet traffic data

Detection of Malicious Repackaged Apps for Encrypted TrafficAs shown in Table 4 AppFA does not consider traffic contentand mainly uses traffic statistical features to detect maliciousAndroid apps Thus AppFA can deal with encrypted trafficSince in MalGenome there is no app whose all connectionsto attack servers are encrypted we have created a newAndroid repackaged malware that communicates with allmalicious servers by encrypted connections In detail wefirst modify AnserverBotrsquos source code and make sure itsnetwork connectionswill be encrypted byTLS protocolThenwe graft it into a popular game app aircomaceviralmotox3Finally we collect network traffic of the repackaged aircomaceviralmotox3 and its peer group In the experiments

Table 9 Experimental results when searching similar apps locally

Detection rate False positive rate692 47

the selected peer group of the repackaged aircomaceviralmotox3 is comwordmobilesbikeRacing comtopfreegamesbikeracefreeworld comskgamestrafficrider comtomicowheel-iechallenge and madmotoracingbike We repeat to detectthe repackaged aircomaceviralmotox3 10 times and thedetection rate is 100 The results show that AppFA canhandle encrypted network traffic

63 Detection in Local Networks In the above experimentsthe peer group for each app is determinedwith the help of appstore such as theGoogle PlayThatmeansAppFAhas to accessthe Internet when performing malicious apps detection Infact AppFA can also be enhanced to choose the appsrsquo peergroups locally namely from the set of already identified appsNote that AppFA uses signature matching to identify appson the network and the identified ones can be treated as amini app store By this AppFA can work on local networksTo find out similar apps from the already identified ones anefficient method based on information retrieval technologiesis proposed and is shown in Algorithm 3

The basic idea of Algorithm 3 is that similar apps mayhave similar functionalities and the context (keywords) maybe similarTherefore we useweb-page searching technologiesto match similar app The 119872119886119905119888ℎ119894119899119892 function in line (4) canbe realized by TF-IDF (Term Frequency-Inverse DocumentFrequency) [43] Again peer groups can be chose by (8) withthe results returned by Algorithm 3

We have implemented the local similar app searching andevaluated AppFArsquos performance in local networks All appsincluding all malicious and benign apps are considered asidentified apps and Algorithm 3 is carried out to find outsimilar apps When 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 for the93 typical repackaged malwares the experimental results arelisted in Table 9 The detection rate is 692 lower than 97in Figure 8The experimental results indicate that AppFA canwork in local networks and the proper peer groups do affectthe final detection significantly

64 Discussion Figures 8 9 10 and 11 show that AppFAgets high detection rates and low false positive rates whendetecting repackaged apps However the detection rate maybecome lower when it is applied to other types of maliciousapps such as sending SMS without notification This may bedue to the fact that other types of mobile malwares transmitless data through network than repackaged apps Note thatthe detection features used inAppFA aremainly related to thepackets sizes and the number of flows So the detection ratemay be declined if the appsrsquo network behaviors are slightlychanged

Compared to the latest work done in [6] AppFA has aslightly lower detection rate (the detection rate is 95ndash999 asreported in [6]) However Garg et alrsquos method [6] consideredonly 18 malware apps and 14 genuine apps and assumed

Security and Communication Networks 13

(1) extract all plaintext keywords from HTTP headers of thetested app 119860 Denote the set of keywords as 119882(119860)

(2) extract all plaintext keywords from HTTP headers of otheridentified apps Denote the set of keywords of identifiedapp 119894 as 119882(119894) 119894 = 0 1 119899 minus 1

(3) while 0 le 119899 do(4) 119878119888119900119903119890119904[119894]V119886119897119906119890 = 119872119886119905119888ℎ119894119899119892(119882(119860) 119882(119894))(5) 119878119888119900119903119890119904[119894]119886119901119901 = 119894(6) 119894 + +(7) end while(8) sorting 119878119888119900119903119890119904[] with 119878119888119900119903119890119904[119894]V119886119897119906119890 by decreasing order(9) return the first 119898119878119888119900119903119890119904[119894]119886119901119901 119894 = 0 1 119898 minus 1

Algorithm 3 Searching similar apps locally

Table 10 Experimental results when the fact that what traffic isgenerated by what app is exactly known In the experiments 1198961 =1198962 = 4 119905119904 = 2 and 119892 = 5

Selected maliciousapps

Remaining maliciousapps

Detection rate 983 85False positiverate 003 27

that the traffic generated by apps was already known (theirdata were collected on the mobile device so their methodis actually a kind of client-side approach) In our workmuch larger samples are taken into account and constrainedclustering is exploited to determine to what app the flowsbelong The accuracy of clustering is also a factor affectingthe detection rate To confirm this we further take the sameassumption as Garg et alrsquos method [6] namely we assumethe accuracy of Algorithm 1 is 100 and know exactly whattraffic is generated by what app (in our experiments we useapp Packet Capture to obtain perfect ground truth of whatflows came from what app) At this time the experimentalresults are given in Table 10 Apparently the detection ratesare much improved

In practice one can deploy both our method and othermethods such as [6 15] simultaneously As indicated by ourexperiments AppFA is efficient in detecting repackaged andself-updatingmalicious apps In fact [15] is suitable forHTTPanalysis and [6] can detect CampC communication efficientlySo our work can be a complement of existing work onmalicious app detection

In this work AppFA mainly uses the statistical features(refer to Table 4) of network traffic to detect Android mali-cious repackaged applications Therefore attackers (repack-aged apps) may change the characteristics of their traffic toremain undetectable But it is quite difficult in practice Asillustrated in Section 2 repackaged apps usually introduceadditional network traffic thus the attackers must removesome normal network connections to keep network behav-iors the same to evade detection However the removal of

normal network connections will impact the functionalitiesof apps and may cause errors This gives additional clues todetect repackaged malware Meanwhile the disappearanceof some network connections may be abnormal as wellTherefore our proposed method is hard to evade

7 Conclusion

In this paper we propose a novel approach AppFA to detectmalicious apps at the network level In AppFA apps are firstidentified from network traffic by signature matching andconstrained clustering Then Kernel Principal ComponentAnalysis is employed to construct app network behavior pro-file and distinguish minor traffic variations from significantdifferences At last we take advantage of peer group analysisto detect malicious apps to avoid time-consuming offlinemodel training Notably AppFA does not need to installprograms or modify operating systems to collect featureinformation Thus it is very convenient to be used and thecost is low The experimental results show that AppFA candetect Android repackaged malware with the detection ratehigher than 90 and a false positive rate lower than 04

The appsrsquo network behaviorsmay be significantly changedby version update and thus cause false positives This needsto be further investigated In the future work AppFA willbe extended to include more network traffic features anddetect more types of malicious apps We will also makeAppFA an open-source project to facilitate further researchesof malicious app detection

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by National Natural Science Founda-tion of China under Grants 61702282 and 61502250 NaturalScience Foundation of the Jiangsu Higher Education Insti-tutions of China under Grant 17KJB520023 NUPTSF underGrant NY217143 and Nanjing Forestry University (GXL016CX2016026)

14 Security and Communication Networks

References

[1] Android statistics Number of android applications September2017 httpwwwappbraincomstatsnumber-of-android-apps

[2] A Mylonas M Theoharidou and D Gritzalis ldquoAssessingPrivacy Risks in Android A User-Centric Approachrdquo in RiskAssessment and Risk-Driven Testing Lecture Notes in ComputerScience pp 21ndash37 Springer International Publishing Cham2014

[3] Does Does your mobile anti-virus app protect or infect you thetruth behind du antivirus security September 2017 httpsresearchcheckpointcommobile-anti-virus-app-protect-infect-truth-behind-du-antivirus-security

[4] A P Felt M Finifter E Chin S Hanna and D Wagner ldquoAsurvey of mobile malware in the wildrdquo in Proceedings of the 1stACM Workshop on Security and Privacy in Smartphones andMobile Devices (SPSM rsquo11) held in Association with the 18th ACMConference on Computer and Communications Security (CCSrsquo11) pp 3ndash14 October 2011

[5] K Tam A Feizollah N B Anuar R Salleh and L Caval-laro ldquoThe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys vol 49 no 4 Article ID3017427 2017

[6] S Garg S K Peddoju and A K Sarje ldquoNetwork-based detec-tion of Android malicious appsrdquo International Journal of Infor-mation Security pp 1ndash16 2016

[7] A Shabtai L Tenenboim-Chekina D Mimran L Rokach BShapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquo Com-puters amp Security vol 43 pp 1ndash18 2014

[8] L Xie X Zhang J-P Seifert and S Zhu ldquoPBMDS a behavior-based malware detection system for cellphone devicesrdquo inProceedings of the 3rd ACM Conference on Wireless NetworkSecurity (WiSec rsquo10) pp 37ndash48 Hoboken NJ USAMarch 2010

[9] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacy inSmartphones and Mobile Devices (SPSM rsquo11) Held in Associationwith the 18th ACM Conference on Computer and Communica-tions Security (CCS rsquo11) pp 15ndash25 October 2011

[10] GDini FMartinelli A Saracino andD Sgandurra ldquoMADAMa multi-level anomaly detector for android malwarerdquo in Com-puter Network Security vol 7531 of Lecture Notes in ComputerScience pp 240ndash253 Springer Berlin Germany 2012

[11] Y Zhang M Yang B Xu et al ldquoVetting undesirable behaviorsin android apps with permission use analysisrdquo in Proceedingsof the in Proceedings of the 2013 ACM SIGSAC conference onComputer amp communications security pp 611ndash622 2013

[12] S Wang Z Chen X Li L Wang K Ji and C Zhao ldquoAndroidMalware Clustering Analysis on Network-Level Behaviorrdquo inIntelligent Computing Theories and Application vol 10361 ofLecture Notes in Computer Science pp 796ndash807 SpringerInternational Publishing Cham Switzerland 2017

[13] A S Shamili C Bauckhage and T Alpcan ldquoMalware detectionon mobile devices using distributed machine learningrdquo inProceedings of the 20th International Conference on PatternRecognition (ICPR rsquo10) pp 4348ndash4351 August 2010

[14] Y Wang J Wei and K Vangury ldquoBring your own device secu-rity issues and challengesrdquo in Proceedings of the 2014 IEEE 11thConsumer Communications and Networking Conference CCNC2014 pp 80ndash85 USA January 2014

[15] P S Chen S-C Lin and C-H Sun ldquoSimple and effectivemethod for detecting abnormal internet behaviors of mobiledevicesrdquo Information Sciences vol 321 Article ID 11536 pp 193ndash204 2015

[16] Global mobile os market share in sales to end users from 1stquarter 2009 to 1st quarter 2017 2017 httpswwwstatistacomstatistics266136global-market-share-held-by-smartphone-op-eratingsystems

[17] ldquoIt threat evolution q3 2017 statisticsrdquo November 2017 httpssecurelistcomit-threat-evolution-q3-2017-statistics83131

[18] L Li D Li T F Bissyande et al ldquoUnderstanding Android AppPiggybacking A Systematic Study of Malicious Code GraftingrdquoIEEE Transactions on Information Forensics and Security vol 12no 6 pp 1269ndash1284 2017

[19] Y Zhou and X Jiang ldquoDissecting android malware characteri-zation and evolutionrdquo in Proceedings of the 33rd IEEE Sympo-sium on Security and Privacy pp 95ndash109 San Francisco CalifUSA May 2012

[20] P Faruki V Laxmi A Bharmal M S Gaur and V Gan-moor ldquoAndroSimilar Robust signature for detecting variantsof Android malwarerdquo Journal of Information Security andApplications vol 22 pp 66ndash80 2015

[21] Y Zhou and X JiangAn analysis of the anserverbot trojan 2011httpwwwcscncsuedufacultyjiangpubsAnserverBot Ana-lysispdf

[22] P Faruki A Bharmal V Laxmi et al ldquoAndroid security a sur-vey of issues malware penetration and defensesrdquo IEEE Com-munications Surveys amp Tutorials vol 17 no 2 pp 998ndash10222015

[23] P Yan and Z Yan ldquoA survey on dynamic mobile malwaredetectionrdquo Software Quality Journal pp 1ndash29 2017

[24] S Dai Y Liu T Wang T Wei and W Zou ldquoBehavior-basedmalware detection on mobile phonerdquo in Proceedings of the 20106th International Conference on Wireless Communications Net-working and Mobile Computing WiCOM 2010 China Septem-ber 2010

[25] D Damopoulos G Kambourakis S Gritzalis and S O ParkldquoExposing mobile malware from the inside (or what is yourmobile app really doing)rdquo Peer-to-Peer Networking and Appli-cations vol 7 no 4 pp 687ndash697 2014

[26] D Damopoulos G Kambourakis and G Portokalidis ldquoThebest of both worlds A framework for the synergistic operationof host and cloud anomaly-based IDS for smartphonesrdquo inProceedings of the 7th European Workshop on System SecurityEuroSec 2014 Netherlands April 2014

[27] Q Xu J Erman A Gerber Z Mao J Pang and S Venkatara-man ldquoIdentifying diverse usage behaviors of smartphone appsrdquoin Proceedings of the ACM SIGCOMM Internet MeasurementConference (IMC rsquo11) pp 329ndash344 November 2011

[28] S Dai A Tongaonkar X Wang A Nucci and D Song ldquoNet-workProfiler Towards automatic fingerprinting of Androidappsrdquo in Proceedings of the 32nd IEEE Conference on ComputerCommunications IEEE INFOCOM 2013 pp 809ndash817 ItalyApril 2013

[29] Q Xu T Andrews Y Liao et al ldquoFlowr A self-learning systemfor classifying mobile application trafficrdquo in Proceedings of the2014 ACM SIGMETRICS International Conference on Measure-ment and Modeling of Computer Systems SIGMETRICS 2014pp 569-570 usa June 2014

[30] Q Xu Y Liao S Miskovic et al ldquoAutomatic generation ofmobile app signatures from traffic observationsrdquo in Proceedings

Security and Communication Networks 15

of the 34th IEEE Annual Conference on Computer Communica-tions and Networks IEEE INFOCOM 2015 pp 1481ndash1489 HongKong May 2015

[31] J Sun L She H Chen et al ldquoAutomatically identifying appsin mobile trafficrdquo Concurrency Computation vol 28 no 14 pp3927ndash3941 2016

[32] G He B Xu and H Zhu ldquoIdentifying Mobile Applications forEncrypted Network Trafficrdquo in Proceedings of the 5th Interna-tional Conference on Advanced Cloud and Big Data CBD 2017pp 279ndash284 China August 2017

[33] G He M Yang X Gu J Luo and Y Ma ldquoA novel activewebsite fingerprinting attack against Tor anonymous systemrdquoin Proceedings of the 2014 18th IEEE International Conference onComputer Supported CooperativeWork inDesign CSCWD2014pp 112ndash117 Taiwan May 2014

[34] G He M Yang J Luo and X Gu ldquoA novel application classi-fication attack against Torrdquo Concurrency and ComputationPractice and Experience vol 27 no 18 pp 5640ndash5661 2015

[35] B Miller L Huang A D Joseph and J D Tygar ldquoI know whyyou went to the clinic Risks and realization of HTTPS trafficanalysisrdquo Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes in Bio-informatics) Preface vol 8555 pp 143ndash163 2014

[36] Y Wang Y Xiang J Zhang W Zhou G Wei and L T YangldquoInternet traffic classification using constrained clusteringrdquoIEEE Transactions on Parallel and Distributed Systems vol 25no 11 pp 2932ndash2943 2014

[37] L J Cao K S Chua W K Chong H P Lee and Q M Gu ldquoAcomparison of PCA KPCA and ICA for dimensionality reduc-tion in support vector machinerdquo Neurocomputing vol 55 no1-2 pp 321ndash336 2003

[38] Y Kim and S Y Sohn ldquoStock fraud detection using peer groupanalysisrdquo Expert Systems with Applications vol 39 no 10 pp8986ndash8992 2012

[39] ldquoOurmon network monitoring and anomaly detection systemrdquoApril 2013 httpourmonsourceforgenet

[40] B Babaki T Guns and S Nijssen ldquoConstrained ClusteringUsing Column Generationrdquo in Integration of AI and OR Tech-niques in Constraint Programming vol 8451 of Lecture Notes inComputer Science pp 438ndash454 Springer International Publish-ing Cham 2014

[41] A Abraham R Andriatsimandefitra A Brunelat J-F Lalandeand V Viet Triem Tong ldquoGroddDroid A gorilla for triggeringmalicious behaviorsrdquo in Proceedings of the 10th InternationalConference on Malicious and Unwanted Software MALWARE2015 pp 119ndash127 USA October 2015

[42] Y Li Z Yang Y Guo and X Chen ldquoDroidBot A lightweightUI-guided test input generator for androidrdquo inProceedings of the39th IEEEACM International Conference on Software Engineer-ing Companion ICSE-C 2017 pp 23ndash26 Argentina May 2017

[43] A Aizawa ldquoAn information-theoretic perspective of tf-idfmeasuresrdquo Information Processing amp Management vol 39 no1 pp 45ndash65 2003

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 8: AppFA: A Novel Approach to Detect Malicious …downloads.hindawi.com/journals/scn/2018/2854728.pdfSecurityandCommunicationNetworks F :Exampleofkey-valuepairs. signs, and the packet

8 Security and Communication Networks

App1 for detection

Update the peergroup of App1

Detection

App1 lt12 100 5000 gt

Network behaviorprofile comparison

Network behaviorprofile comparison

Historical data Peer grouplt11 110 5020 gt

lt10 130 4600 gt

lt12 100 5000 gt

lt12 101 5010 gt

e profile is similar to thehistory App1 has not beeninjected in malicious code

App2 lt20 400 6000 gt

App3 lt30 500 7000 gt

App4 lt25 450 6500 gt

App5 lt19 370 5580 gt

e profile is significantly differentfrom its peer group (behavior-similar apps) App1 may be a fakeapp

Figure 5 Illustration of the procedure of malicious app detection The numbers in angle brackets are examples of features listed in Table 4

Figure 6 Example of similar apps recommended by Google Play

In practice one can resort to app stores such as GooglePlay to find out candidates that will meet condition AFigure 6 shows that when viewing the details of an appGoogle Play will recommend the similar ones These similarapps are determined by several features such as category ofapps keywords in the title and description and size of the apk(httpswwwquoracomHow-does-the-Google-Play-Store-determine-similar-apps) Generally these recommendedapps belong to the same category and have similar function-ality Therefore in order to determine the peer group of anapp we first use the similar apps recommended by app storessuch as Google Play as candidates and later filter them byconditionB

For condition B the Euclidean distance is used tomeasure the similarity of network behavior profiles Suppose119860 is the app to be analyzed and 119861 is the app satisfying con-dition A for 119860 The similarity of network behavior profilesbetween 119860 and 119861 is calculated as

119889 (119860 119861)

= radic(119899119861119875 (119860) minus 119899119861119875 (119861)) (119899119861119875 (119860) minus 119899119861119875 (119861))119879(8)

where 119899119861119875 is the abbreviation of 119899119890119905119861119890ℎ119886V119894119900119903119875119903119900119891119894119897119890 definedin (1)

Once the similarities between 119860 and the recommendedapps are calculated by (8) the results are sorted in the orderof increasing distance (ie decreasing similarity) The first119892 apps will be selected as the peer group members of 119860 asshown in (9) The optimal value of 119892 can be determined bythe experiments which is discussed in the next section Notethat the peer group members can be updated For example ifone of the peer groupmembers has been flagged asmaliciousit will be removed from the peer group and a new one will beadded AppFA can also reselect the peer groups every 119879 timeinterval

peerGroup (119860) fl app1 app119892

where 119889 (119860 app119894) ge 119889 (119860 app119895) if 119894 lt 119895(9)

As illustrated in Figures 2 and 5 for an identified appAppFA first constructs its network behavior profile and then

Security and Communication Networks 9

compares the profile with both its historical profiles andthe profiles of its peer groups for malicious apps detectionDenote 119900 as the profile of the identified app 119860 and 119866 as thematrix of the compared profiles (historical profiles or peergroup profiles) Each column of 119866 is one profile and thecolumn length is 119897 (119897 = (1 + 2 lowast (119891119900 + 119891119894) + 2 lowast (1198961 + 1198962)))AppFAmakes the feature vectors the same length by paddingzeros

For peer group analysis firstly 119900 and 119866 are normalized bymin-max normalization procedure defined in

119900119894 =119900119894

max (119900119894 1198661198941 1198661198942 119866119894119892)

119866119894119895 =119866119894119895

max (119900119894 1198661198941 1198661198942 119866119894119892)

119894 = 1 2 119897 119895 = 1 2 119892

(10)

Then the distance between 119900 and 119866 is calculated byMahalanobis distance

120591 = radic(119900 minus 119898)119879 119878minus1 (119900 minus 119898) (11)

where 119898 is the weighted mean vector of profiles

1198981198941 =119892

sum119895=1

119908119860119895119866119894119895 1 le 119894 le 119897 (12)

and 119878 is the covariance matrix and it is calculated in

119878 = 119864 (119866 minus 119864 [119866]) (119866 minus 119864 [119866])119879 (13)

In (12) 119908119860119895 is the weight of the 119895th closest peer groupmember of the analyzed app 119860 The weights of peer groupmembers are obtained from their proximity to 119860 In detailwe define the proximity of the 119895th closest peer groupmemberof the target app 119860 in

prox119860119895 = exp (minus119889 (119860 119895)) (14)

where 119889(119860 119895) is the Euclidean distance between app 119860 and its119895th closest peer group member defined in (8) Based on theproximity measure we defined above the weight of the 119895thclosest peer group member of app 119860 is defined in

119908119860119895 =prox119860119895

sum119892119895=1 prox119860119895(15)

Finally the state of apps is judged as follows

120591 gt 119905119904 malicious app

120591 le 119905119904 normal app(16)

where 119905119904 is the threshold and it can be set to different valuesby network administrators based on the actual network con-dition We evaluated different values of 119905119904 in our experiment

Internet

ServerUbuntu 1604

functions

Testing mobiledevices(HUAWEI andMI Phones)

(i) Access point(ii) Traffic collection

(iii) Data analysis

Samples(apps)

Figure 7 Experimental setup for the detection of mobile maliciousapps

6 Evaluation

61 Data Collection

Experimental Setup The experimental setup used for AppFAis shown in Figure 7 We have implemented a prototypeof AppFA with the help of ourmon [39] and CCCG [40]Ourmon is an open-source networkmonitoring and anomalydetection system and CCCG is a general framework forconstrained clustering A Ubuntu 1604 computer has beenconfigured as the access point For packet capturing thesmart phone is connected to the Internet by WIFI andnetwork traffic is collected at the access point by tcpdump(httpwwwtcpdumporg) Each pcap file is fixed up to100MB When packet capture is complete the TCP andUPD flows are split by SplitCap (httpswwwnetreseccompage=SplitCap) from pcap files For appsrsquo traffic clusteringand network behavior profile construction we use tshark(httpswwwwiresharkorgdocsman-pagestsharkhtml) toextract IP address packet sizes and packet interval timesfrom network flows After that a Python program is writtento calculate the statistical features as listed in Tables 3 and 4The KPCA transformation is accomplished with the help ofscikit-learn (httpscikit-learnorg)The data analysis is alsocompleted in theUbuntu 1604 computer (with 4GBmemoryand Pentium Dual-Core CPU T4500)

TrafficGenerationWe use the publicMalGenome [19] datasetin our experiments Since we mainly focus on repackag-ing and updating malwares 93 typical information collectionmalwares including repackaging and updating attack typesare selected from MalGenome to test the detection rate ofAppFA (the malicious apps dataset was downloaded fromhttpwwwmalgenomeprojectorg in 2015 These selectedmalwares are the ones run without any errors) We haveinstalled all these malicious apps in HUAWEI Honor 8 and aMI Note 4 phones and run these apps one by one 50 times tocollect network traffic In detail these malwares are analyzedand run by GroddDroid [41] to make sure that maliciouscodes will be triggered We also use GroddDroid to runother malwares besides the selected ones and their trafficwill be mainly used in the local detection as described in

10 Security and Communication Networks

Section 63 Particularly the background traffic includingweather forecasting email checking QQ and Webchat (QQand Webchat are the most famous apps in China) is allowedin our experiments The ground truth of the originator ofnetwork traffic is determined by Packet Capture (httpsplaygooglecomstoreappsdetailsid=appgreyshirtssslcaptureamphl=zh) to test the accuracy of app traffic clustering

For testing the false positive rates of AppFA we rely onthe GooglePlayAppsCrawlerpy project to identify the 100most popular free apps These 100 free apps have also beenrun one by one 50 times While different from malwaresthe benign apps are run by droidbot [42] to make sure togenerate enough traffic With these captured traffic we takeup all nonzero packets for app traffic clustering and networkbehavior profile construction Namely in our experimentsthe value of 119899 in Packet filter model is set to minus1

The data used in the experiments is summarized inTable 5 TrafficSet 1 consists of network traffic generated bythe selected 93 malwares and TrafficSet 2 consists of networktraffic of benign apps TrafficSet 3 is the traffic of all malwaresthus it includes TrafficSet 1 TrafficSet 3 is mainly used formalicious apps detection in local networks

62 Experimental Results

Accuracy of Traffic Clustering TrafficSet 1 is used for testingthe accuracy of app traffic clustering and Packet Capture isexploited to get the ground truth of the originator of networktraffic Recall that in Algorithm 1 app traffic clustering mustbe started with signature seeds In order to obtain appsrsquosignature seeds we randomly choose several HTTP flowsfor each app and extract the key-value pairs in their HTTPheaders as signature seeds as described in Section 4 Thevalue of cluster number 119901 in Algorithm 2 is set to 97 (97 =93 + 4) since there are 93 malwares and 4 background appsthat is weather forecasting email checking QQ andWebchatThe experimental results are shown in Table 6

As shown in Table 6 for each app when 5 HTTP flowsare chosen to generate signature seeds only 601 flows (in all223220 flows) aremisclusteredThis proves the efficiencies ofour proposed method for app traffic clustering

Experimental Results of Malicious App Detection Similarto previous work in order to measure the effectiveness ofmalicious apps detection accuracy (detection rate) and falsepositive rate metrics are defined in (17) and (18) where TPFN FP and TN stand for true positive false negative falsepositive and true negative respectively

Accuracy = TPTP + FN

(17)

False positive rate = FPFP + TN

(18)

We first examine the detection rates and false positiverates of AppFA with different values of 119905119904 and 119896119894 (119894 = 1 2)For the selected malicious apps the apps with the samefunctionality (satisfying condition A) are determined byGoogle Play We first analyze the malicious appsrsquo function-alities manually and choose a keyword for each app Then

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

05

06

07

08

09

1

Det

ectio

n ra

te

06 08 1 12 14 16 18 2 04ts

Figure 8 Detection rates of AppFA with different values of 119905119904 and119896119894 (119894 = 1 2)

we search the keyword in Google Play and use the returnedapps as candidates After that the peer group is determined by(8) and (9) In our experiments the largest value of membercount 119892 is set to 10 Note that 119905119904 is the threshold defined in(16) and 119896119894 is the number of KPCA transformations definedin Table 4 We set the member count 119892 in peer group analysisto 5 namely for each tested app its top 5 similar apps areselected and analyzedThedetection rate is shown in Figure 8and the false positive rate is depicted in Figure 9As illustratedin Figures 8 and 9 the first 4 KPCA transformations of packetsizes and intervals are good enough for constructing networkbehavior profileWhen 1198961 = 1198962 = 4 and 119905119904 gt 12 the detectionrate is higher than 90 and the false positive rate is lowerthan 04 The detection rate is as high as 97 when 119905119904 = 2The experimental results demonstrate the effectiveness of ourproposed approach

We then test AppFA with different value of 119892 The valuesof 119905119904 and 119896119894 (119894 = 1 2) are set to 2 and 4 respectivelyThe experimental results are shown in Figures 10 and 11 Asillustrated in these figures when 119892 increases the detectionrates and the false positive rates are slightly changed Thiscan be explained as the Mahalanobis distance 120591 defined in(11) is calculated by considering the weighted mean vector ofprofiles So the most similar app has the largest weight andthe added peers will have smaller weights and thus providelittle contribution to the final detection As observed in ourexperiments 119892 = 5 is suitable for practical deployment

Further AppFA with different value of cluster number 119901is also tested Note that the actual cluster number is 97 forTrafficSet 1 In order to evaluate how app traffic clusteringimpacts the detection of repackaged malware we slightlychange the value of 119901 and the experiments results are asshown in Figure 12 In the experiments we set 1198961 = 1198962 = 4119905119904 = 2 and 119892 = 5 As shown in Figure 12 the cluster numberindeed impacts the detection rate of AppFA Particularlywhen 119901 = 99 the detection rate is only 88 namely only 82

Security and Communication Networks 11

Table 5 Data summary

Traffic set Apps Quantity Number of collectedflows (TCP and UPD) Amount of data (bytes)

TrafficSet 1 Selected malicious apps(for testing accuracy of traffic clustering and detection rates) 93 223220 538G

TrafficSet 2 Benign apps(for testing false positive rates of AppFA) 100 305103 640G

TrafficSet 3 All malicious apps(for malicious apps detection in local networks) 1260 asymp31 million 1872G

Table 6 Experimental results of traffic clustering

Number of HTTP flows forgenerating signature seeds

Number of flowscorrectly clustered

Number ofmisclustered

flows2 201357 218633 218946 42744 221384 18365 222619 601

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

00005

0001

00015

0002

00025

0003

00035

0004

False

pos

itive

rate

06 08 1 12 14 16 18 2 04ts

Figure 9 False positive rates of AppFA with different values of 119905119904and 119896119894(119894 = 1 2)

malicious apps are correctly detected However when 119901 lt 97(119901 = 95 and 96 in Figure 12) the detection rates are almost thesame namely 97Therefore AppFAwill be competent witha few unknown apps when carrying out app traffic clustering

Finally we evaluate AppFA with the remaining maliciousapps besides the selected 93 samplesThe rest of themaliciousapps contain repackaged and other types of malwares Thesemalwares are found to have some errors when run on ourphones However they also produced considerable networktraffic We test AppFA with these apps for examining thegeneral capability of AppFA The parameters are set up asfollows 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 The experimentalresults are shown in Table 7 The detection rate is 734and the false positive rate is 35 Compared to the results

1 3 5 7 10g

Detection rate

0

01

02

03

04

05

06

07

08

09

1

rate

()

Figure 10 Detection rates of AppFA with different values of 119892

1 3 5 7 10g

False positive rate

times10minus3

0

1

2

3

4

5

6

7

rate

()

Figure 11 False positive rates of AppFA with different values of 119892

displayed in Figures 8 9 10 and 11 the detection rate issignificantly reduced and the false positive rate is increasedThe possible reasons will be discussed in Section 64

Note that AppFA is designed to perform nearly real-timemalicious apps detection thus efficiency is also a big concernTable 8 presents the computational performance for major

12 Security and Communication Networks

99989695value of p

0

01

02

03

04

05

06

07

08

09

1

dete

ctio

n ra

te (

)

Figure 12 Detection rates of AppFA with different values of 119901

Table 7 Experimental results for the malicious apps besides theselected 93 samples

Detection rate False positive rate734 35

Table 8 The computational performance

Procedures TimeConstrained flow clustering (Algorithm 1) 3minApp identification (Algorithm 2) 52minNetwork behavior profile construction 17minPeer group analysis 45 sec

steps in terms of average running time using the experimentaldata Generally the proposedmethod can be deployed onlineto detect malicious apps Specifically it is fast typically lessthan one minute to exploit peer group analysis to performmalicious app detection The app identification procedureis the most time-consuming step because the signaturematching and constrained flow clustering are carried out oneach flow Parallel computing such as cloud resources can befurther used to speed the malicious apps detection and scaleup to more Internet traffic data

Detection of Malicious Repackaged Apps for Encrypted TrafficAs shown in Table 4 AppFA does not consider traffic contentand mainly uses traffic statistical features to detect maliciousAndroid apps Thus AppFA can deal with encrypted trafficSince in MalGenome there is no app whose all connectionsto attack servers are encrypted we have created a newAndroid repackaged malware that communicates with allmalicious servers by encrypted connections In detail wefirst modify AnserverBotrsquos source code and make sure itsnetwork connectionswill be encrypted byTLS protocolThenwe graft it into a popular game app aircomaceviralmotox3Finally we collect network traffic of the repackaged aircomaceviralmotox3 and its peer group In the experiments

Table 9 Experimental results when searching similar apps locally

Detection rate False positive rate692 47

the selected peer group of the repackaged aircomaceviralmotox3 is comwordmobilesbikeRacing comtopfreegamesbikeracefreeworld comskgamestrafficrider comtomicowheel-iechallenge and madmotoracingbike We repeat to detectthe repackaged aircomaceviralmotox3 10 times and thedetection rate is 100 The results show that AppFA canhandle encrypted network traffic

63 Detection in Local Networks In the above experimentsthe peer group for each app is determinedwith the help of appstore such as theGoogle PlayThatmeansAppFAhas to accessthe Internet when performing malicious apps detection Infact AppFA can also be enhanced to choose the appsrsquo peergroups locally namely from the set of already identified appsNote that AppFA uses signature matching to identify appson the network and the identified ones can be treated as amini app store By this AppFA can work on local networksTo find out similar apps from the already identified ones anefficient method based on information retrieval technologiesis proposed and is shown in Algorithm 3

The basic idea of Algorithm 3 is that similar apps mayhave similar functionalities and the context (keywords) maybe similarTherefore we useweb-page searching technologiesto match similar app The 119872119886119905119888ℎ119894119899119892 function in line (4) canbe realized by TF-IDF (Term Frequency-Inverse DocumentFrequency) [43] Again peer groups can be chose by (8) withthe results returned by Algorithm 3

We have implemented the local similar app searching andevaluated AppFArsquos performance in local networks All appsincluding all malicious and benign apps are considered asidentified apps and Algorithm 3 is carried out to find outsimilar apps When 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 for the93 typical repackaged malwares the experimental results arelisted in Table 9 The detection rate is 692 lower than 97in Figure 8The experimental results indicate that AppFA canwork in local networks and the proper peer groups do affectthe final detection significantly

64 Discussion Figures 8 9 10 and 11 show that AppFAgets high detection rates and low false positive rates whendetecting repackaged apps However the detection rate maybecome lower when it is applied to other types of maliciousapps such as sending SMS without notification This may bedue to the fact that other types of mobile malwares transmitless data through network than repackaged apps Note thatthe detection features used inAppFA aremainly related to thepackets sizes and the number of flows So the detection ratemay be declined if the appsrsquo network behaviors are slightlychanged

Compared to the latest work done in [6] AppFA has aslightly lower detection rate (the detection rate is 95ndash999 asreported in [6]) However Garg et alrsquos method [6] consideredonly 18 malware apps and 14 genuine apps and assumed

Security and Communication Networks 13

(1) extract all plaintext keywords from HTTP headers of thetested app 119860 Denote the set of keywords as 119882(119860)

(2) extract all plaintext keywords from HTTP headers of otheridentified apps Denote the set of keywords of identifiedapp 119894 as 119882(119894) 119894 = 0 1 119899 minus 1

(3) while 0 le 119899 do(4) 119878119888119900119903119890119904[119894]V119886119897119906119890 = 119872119886119905119888ℎ119894119899119892(119882(119860) 119882(119894))(5) 119878119888119900119903119890119904[119894]119886119901119901 = 119894(6) 119894 + +(7) end while(8) sorting 119878119888119900119903119890119904[] with 119878119888119900119903119890119904[119894]V119886119897119906119890 by decreasing order(9) return the first 119898119878119888119900119903119890119904[119894]119886119901119901 119894 = 0 1 119898 minus 1

Algorithm 3 Searching similar apps locally

Table 10 Experimental results when the fact that what traffic isgenerated by what app is exactly known In the experiments 1198961 =1198962 = 4 119905119904 = 2 and 119892 = 5

Selected maliciousapps

Remaining maliciousapps

Detection rate 983 85False positiverate 003 27

that the traffic generated by apps was already known (theirdata were collected on the mobile device so their methodis actually a kind of client-side approach) In our workmuch larger samples are taken into account and constrainedclustering is exploited to determine to what app the flowsbelong The accuracy of clustering is also a factor affectingthe detection rate To confirm this we further take the sameassumption as Garg et alrsquos method [6] namely we assumethe accuracy of Algorithm 1 is 100 and know exactly whattraffic is generated by what app (in our experiments we useapp Packet Capture to obtain perfect ground truth of whatflows came from what app) At this time the experimentalresults are given in Table 10 Apparently the detection ratesare much improved

In practice one can deploy both our method and othermethods such as [6 15] simultaneously As indicated by ourexperiments AppFA is efficient in detecting repackaged andself-updatingmalicious apps In fact [15] is suitable forHTTPanalysis and [6] can detect CampC communication efficientlySo our work can be a complement of existing work onmalicious app detection

In this work AppFA mainly uses the statistical features(refer to Table 4) of network traffic to detect Android mali-cious repackaged applications Therefore attackers (repack-aged apps) may change the characteristics of their traffic toremain undetectable But it is quite difficult in practice Asillustrated in Section 2 repackaged apps usually introduceadditional network traffic thus the attackers must removesome normal network connections to keep network behav-iors the same to evade detection However the removal of

normal network connections will impact the functionalitiesof apps and may cause errors This gives additional clues todetect repackaged malware Meanwhile the disappearanceof some network connections may be abnormal as wellTherefore our proposed method is hard to evade

7 Conclusion

In this paper we propose a novel approach AppFA to detectmalicious apps at the network level In AppFA apps are firstidentified from network traffic by signature matching andconstrained clustering Then Kernel Principal ComponentAnalysis is employed to construct app network behavior pro-file and distinguish minor traffic variations from significantdifferences At last we take advantage of peer group analysisto detect malicious apps to avoid time-consuming offlinemodel training Notably AppFA does not need to installprograms or modify operating systems to collect featureinformation Thus it is very convenient to be used and thecost is low The experimental results show that AppFA candetect Android repackaged malware with the detection ratehigher than 90 and a false positive rate lower than 04

The appsrsquo network behaviorsmay be significantly changedby version update and thus cause false positives This needsto be further investigated In the future work AppFA willbe extended to include more network traffic features anddetect more types of malicious apps We will also makeAppFA an open-source project to facilitate further researchesof malicious app detection

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by National Natural Science Founda-tion of China under Grants 61702282 and 61502250 NaturalScience Foundation of the Jiangsu Higher Education Insti-tutions of China under Grant 17KJB520023 NUPTSF underGrant NY217143 and Nanjing Forestry University (GXL016CX2016026)

14 Security and Communication Networks

References

[1] Android statistics Number of android applications September2017 httpwwwappbraincomstatsnumber-of-android-apps

[2] A Mylonas M Theoharidou and D Gritzalis ldquoAssessingPrivacy Risks in Android A User-Centric Approachrdquo in RiskAssessment and Risk-Driven Testing Lecture Notes in ComputerScience pp 21ndash37 Springer International Publishing Cham2014

[3] Does Does your mobile anti-virus app protect or infect you thetruth behind du antivirus security September 2017 httpsresearchcheckpointcommobile-anti-virus-app-protect-infect-truth-behind-du-antivirus-security

[4] A P Felt M Finifter E Chin S Hanna and D Wagner ldquoAsurvey of mobile malware in the wildrdquo in Proceedings of the 1stACM Workshop on Security and Privacy in Smartphones andMobile Devices (SPSM rsquo11) held in Association with the 18th ACMConference on Computer and Communications Security (CCSrsquo11) pp 3ndash14 October 2011

[5] K Tam A Feizollah N B Anuar R Salleh and L Caval-laro ldquoThe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys vol 49 no 4 Article ID3017427 2017

[6] S Garg S K Peddoju and A K Sarje ldquoNetwork-based detec-tion of Android malicious appsrdquo International Journal of Infor-mation Security pp 1ndash16 2016

[7] A Shabtai L Tenenboim-Chekina D Mimran L Rokach BShapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquo Com-puters amp Security vol 43 pp 1ndash18 2014

[8] L Xie X Zhang J-P Seifert and S Zhu ldquoPBMDS a behavior-based malware detection system for cellphone devicesrdquo inProceedings of the 3rd ACM Conference on Wireless NetworkSecurity (WiSec rsquo10) pp 37ndash48 Hoboken NJ USAMarch 2010

[9] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacy inSmartphones and Mobile Devices (SPSM rsquo11) Held in Associationwith the 18th ACM Conference on Computer and Communica-tions Security (CCS rsquo11) pp 15ndash25 October 2011

[10] GDini FMartinelli A Saracino andD Sgandurra ldquoMADAMa multi-level anomaly detector for android malwarerdquo in Com-puter Network Security vol 7531 of Lecture Notes in ComputerScience pp 240ndash253 Springer Berlin Germany 2012

[11] Y Zhang M Yang B Xu et al ldquoVetting undesirable behaviorsin android apps with permission use analysisrdquo in Proceedingsof the in Proceedings of the 2013 ACM SIGSAC conference onComputer amp communications security pp 611ndash622 2013

[12] S Wang Z Chen X Li L Wang K Ji and C Zhao ldquoAndroidMalware Clustering Analysis on Network-Level Behaviorrdquo inIntelligent Computing Theories and Application vol 10361 ofLecture Notes in Computer Science pp 796ndash807 SpringerInternational Publishing Cham Switzerland 2017

[13] A S Shamili C Bauckhage and T Alpcan ldquoMalware detectionon mobile devices using distributed machine learningrdquo inProceedings of the 20th International Conference on PatternRecognition (ICPR rsquo10) pp 4348ndash4351 August 2010

[14] Y Wang J Wei and K Vangury ldquoBring your own device secu-rity issues and challengesrdquo in Proceedings of the 2014 IEEE 11thConsumer Communications and Networking Conference CCNC2014 pp 80ndash85 USA January 2014

[15] P S Chen S-C Lin and C-H Sun ldquoSimple and effectivemethod for detecting abnormal internet behaviors of mobiledevicesrdquo Information Sciences vol 321 Article ID 11536 pp 193ndash204 2015

[16] Global mobile os market share in sales to end users from 1stquarter 2009 to 1st quarter 2017 2017 httpswwwstatistacomstatistics266136global-market-share-held-by-smartphone-op-eratingsystems

[17] ldquoIt threat evolution q3 2017 statisticsrdquo November 2017 httpssecurelistcomit-threat-evolution-q3-2017-statistics83131

[18] L Li D Li T F Bissyande et al ldquoUnderstanding Android AppPiggybacking A Systematic Study of Malicious Code GraftingrdquoIEEE Transactions on Information Forensics and Security vol 12no 6 pp 1269ndash1284 2017

[19] Y Zhou and X Jiang ldquoDissecting android malware characteri-zation and evolutionrdquo in Proceedings of the 33rd IEEE Sympo-sium on Security and Privacy pp 95ndash109 San Francisco CalifUSA May 2012

[20] P Faruki V Laxmi A Bharmal M S Gaur and V Gan-moor ldquoAndroSimilar Robust signature for detecting variantsof Android malwarerdquo Journal of Information Security andApplications vol 22 pp 66ndash80 2015

[21] Y Zhou and X JiangAn analysis of the anserverbot trojan 2011httpwwwcscncsuedufacultyjiangpubsAnserverBot Ana-lysispdf

[22] P Faruki A Bharmal V Laxmi et al ldquoAndroid security a sur-vey of issues malware penetration and defensesrdquo IEEE Com-munications Surveys amp Tutorials vol 17 no 2 pp 998ndash10222015

[23] P Yan and Z Yan ldquoA survey on dynamic mobile malwaredetectionrdquo Software Quality Journal pp 1ndash29 2017

[24] S Dai Y Liu T Wang T Wei and W Zou ldquoBehavior-basedmalware detection on mobile phonerdquo in Proceedings of the 20106th International Conference on Wireless Communications Net-working and Mobile Computing WiCOM 2010 China Septem-ber 2010

[25] D Damopoulos G Kambourakis S Gritzalis and S O ParkldquoExposing mobile malware from the inside (or what is yourmobile app really doing)rdquo Peer-to-Peer Networking and Appli-cations vol 7 no 4 pp 687ndash697 2014

[26] D Damopoulos G Kambourakis and G Portokalidis ldquoThebest of both worlds A framework for the synergistic operationof host and cloud anomaly-based IDS for smartphonesrdquo inProceedings of the 7th European Workshop on System SecurityEuroSec 2014 Netherlands April 2014

[27] Q Xu J Erman A Gerber Z Mao J Pang and S Venkatara-man ldquoIdentifying diverse usage behaviors of smartphone appsrdquoin Proceedings of the ACM SIGCOMM Internet MeasurementConference (IMC rsquo11) pp 329ndash344 November 2011

[28] S Dai A Tongaonkar X Wang A Nucci and D Song ldquoNet-workProfiler Towards automatic fingerprinting of Androidappsrdquo in Proceedings of the 32nd IEEE Conference on ComputerCommunications IEEE INFOCOM 2013 pp 809ndash817 ItalyApril 2013

[29] Q Xu T Andrews Y Liao et al ldquoFlowr A self-learning systemfor classifying mobile application trafficrdquo in Proceedings of the2014 ACM SIGMETRICS International Conference on Measure-ment and Modeling of Computer Systems SIGMETRICS 2014pp 569-570 usa June 2014

[30] Q Xu Y Liao S Miskovic et al ldquoAutomatic generation ofmobile app signatures from traffic observationsrdquo in Proceedings

Security and Communication Networks 15

of the 34th IEEE Annual Conference on Computer Communica-tions and Networks IEEE INFOCOM 2015 pp 1481ndash1489 HongKong May 2015

[31] J Sun L She H Chen et al ldquoAutomatically identifying appsin mobile trafficrdquo Concurrency Computation vol 28 no 14 pp3927ndash3941 2016

[32] G He B Xu and H Zhu ldquoIdentifying Mobile Applications forEncrypted Network Trafficrdquo in Proceedings of the 5th Interna-tional Conference on Advanced Cloud and Big Data CBD 2017pp 279ndash284 China August 2017

[33] G He M Yang X Gu J Luo and Y Ma ldquoA novel activewebsite fingerprinting attack against Tor anonymous systemrdquoin Proceedings of the 2014 18th IEEE International Conference onComputer Supported CooperativeWork inDesign CSCWD2014pp 112ndash117 Taiwan May 2014

[34] G He M Yang J Luo and X Gu ldquoA novel application classi-fication attack against Torrdquo Concurrency and ComputationPractice and Experience vol 27 no 18 pp 5640ndash5661 2015

[35] B Miller L Huang A D Joseph and J D Tygar ldquoI know whyyou went to the clinic Risks and realization of HTTPS trafficanalysisrdquo Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes in Bio-informatics) Preface vol 8555 pp 143ndash163 2014

[36] Y Wang Y Xiang J Zhang W Zhou G Wei and L T YangldquoInternet traffic classification using constrained clusteringrdquoIEEE Transactions on Parallel and Distributed Systems vol 25no 11 pp 2932ndash2943 2014

[37] L J Cao K S Chua W K Chong H P Lee and Q M Gu ldquoAcomparison of PCA KPCA and ICA for dimensionality reduc-tion in support vector machinerdquo Neurocomputing vol 55 no1-2 pp 321ndash336 2003

[38] Y Kim and S Y Sohn ldquoStock fraud detection using peer groupanalysisrdquo Expert Systems with Applications vol 39 no 10 pp8986ndash8992 2012

[39] ldquoOurmon network monitoring and anomaly detection systemrdquoApril 2013 httpourmonsourceforgenet

[40] B Babaki T Guns and S Nijssen ldquoConstrained ClusteringUsing Column Generationrdquo in Integration of AI and OR Tech-niques in Constraint Programming vol 8451 of Lecture Notes inComputer Science pp 438ndash454 Springer International Publish-ing Cham 2014

[41] A Abraham R Andriatsimandefitra A Brunelat J-F Lalandeand V Viet Triem Tong ldquoGroddDroid A gorilla for triggeringmalicious behaviorsrdquo in Proceedings of the 10th InternationalConference on Malicious and Unwanted Software MALWARE2015 pp 119ndash127 USA October 2015

[42] Y Li Z Yang Y Guo and X Chen ldquoDroidBot A lightweightUI-guided test input generator for androidrdquo inProceedings of the39th IEEEACM International Conference on Software Engineer-ing Companion ICSE-C 2017 pp 23ndash26 Argentina May 2017

[43] A Aizawa ldquoAn information-theoretic perspective of tf-idfmeasuresrdquo Information Processing amp Management vol 39 no1 pp 45ndash65 2003

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 9: AppFA: A Novel Approach to Detect Malicious …downloads.hindawi.com/journals/scn/2018/2854728.pdfSecurityandCommunicationNetworks F :Exampleofkey-valuepairs. signs, and the packet

Security and Communication Networks 9

compares the profile with both its historical profiles andthe profiles of its peer groups for malicious apps detectionDenote 119900 as the profile of the identified app 119860 and 119866 as thematrix of the compared profiles (historical profiles or peergroup profiles) Each column of 119866 is one profile and thecolumn length is 119897 (119897 = (1 + 2 lowast (119891119900 + 119891119894) + 2 lowast (1198961 + 1198962)))AppFAmakes the feature vectors the same length by paddingzeros

For peer group analysis firstly 119900 and 119866 are normalized bymin-max normalization procedure defined in

119900119894 =119900119894

max (119900119894 1198661198941 1198661198942 119866119894119892)

119866119894119895 =119866119894119895

max (119900119894 1198661198941 1198661198942 119866119894119892)

119894 = 1 2 119897 119895 = 1 2 119892

(10)

Then the distance between 119900 and 119866 is calculated byMahalanobis distance

120591 = radic(119900 minus 119898)119879 119878minus1 (119900 minus 119898) (11)

where 119898 is the weighted mean vector of profiles

1198981198941 =119892

sum119895=1

119908119860119895119866119894119895 1 le 119894 le 119897 (12)

and 119878 is the covariance matrix and it is calculated in

119878 = 119864 (119866 minus 119864 [119866]) (119866 minus 119864 [119866])119879 (13)

In (12) 119908119860119895 is the weight of the 119895th closest peer groupmember of the analyzed app 119860 The weights of peer groupmembers are obtained from their proximity to 119860 In detailwe define the proximity of the 119895th closest peer groupmemberof the target app 119860 in

prox119860119895 = exp (minus119889 (119860 119895)) (14)

where 119889(119860 119895) is the Euclidean distance between app 119860 and its119895th closest peer group member defined in (8) Based on theproximity measure we defined above the weight of the 119895thclosest peer group member of app 119860 is defined in

119908119860119895 =prox119860119895

sum119892119895=1 prox119860119895(15)

Finally the state of apps is judged as follows

120591 gt 119905119904 malicious app

120591 le 119905119904 normal app(16)

where 119905119904 is the threshold and it can be set to different valuesby network administrators based on the actual network con-dition We evaluated different values of 119905119904 in our experiment

Internet

ServerUbuntu 1604

functions

Testing mobiledevices(HUAWEI andMI Phones)

(i) Access point(ii) Traffic collection

(iii) Data analysis

Samples(apps)

Figure 7 Experimental setup for the detection of mobile maliciousapps

6 Evaluation

61 Data Collection

Experimental Setup The experimental setup used for AppFAis shown in Figure 7 We have implemented a prototypeof AppFA with the help of ourmon [39] and CCCG [40]Ourmon is an open-source networkmonitoring and anomalydetection system and CCCG is a general framework forconstrained clustering A Ubuntu 1604 computer has beenconfigured as the access point For packet capturing thesmart phone is connected to the Internet by WIFI andnetwork traffic is collected at the access point by tcpdump(httpwwwtcpdumporg) Each pcap file is fixed up to100MB When packet capture is complete the TCP andUPD flows are split by SplitCap (httpswwwnetreseccompage=SplitCap) from pcap files For appsrsquo traffic clusteringand network behavior profile construction we use tshark(httpswwwwiresharkorgdocsman-pagestsharkhtml) toextract IP address packet sizes and packet interval timesfrom network flows After that a Python program is writtento calculate the statistical features as listed in Tables 3 and 4The KPCA transformation is accomplished with the help ofscikit-learn (httpscikit-learnorg)The data analysis is alsocompleted in theUbuntu 1604 computer (with 4GBmemoryand Pentium Dual-Core CPU T4500)

TrafficGenerationWe use the publicMalGenome [19] datasetin our experiments Since we mainly focus on repackag-ing and updating malwares 93 typical information collectionmalwares including repackaging and updating attack typesare selected from MalGenome to test the detection rate ofAppFA (the malicious apps dataset was downloaded fromhttpwwwmalgenomeprojectorg in 2015 These selectedmalwares are the ones run without any errors) We haveinstalled all these malicious apps in HUAWEI Honor 8 and aMI Note 4 phones and run these apps one by one 50 times tocollect network traffic In detail these malwares are analyzedand run by GroddDroid [41] to make sure that maliciouscodes will be triggered We also use GroddDroid to runother malwares besides the selected ones and their trafficwill be mainly used in the local detection as described in

10 Security and Communication Networks

Section 63 Particularly the background traffic includingweather forecasting email checking QQ and Webchat (QQand Webchat are the most famous apps in China) is allowedin our experiments The ground truth of the originator ofnetwork traffic is determined by Packet Capture (httpsplaygooglecomstoreappsdetailsid=appgreyshirtssslcaptureamphl=zh) to test the accuracy of app traffic clustering

For testing the false positive rates of AppFA we rely onthe GooglePlayAppsCrawlerpy project to identify the 100most popular free apps These 100 free apps have also beenrun one by one 50 times While different from malwaresthe benign apps are run by droidbot [42] to make sure togenerate enough traffic With these captured traffic we takeup all nonzero packets for app traffic clustering and networkbehavior profile construction Namely in our experimentsthe value of 119899 in Packet filter model is set to minus1

The data used in the experiments is summarized inTable 5 TrafficSet 1 consists of network traffic generated bythe selected 93 malwares and TrafficSet 2 consists of networktraffic of benign apps TrafficSet 3 is the traffic of all malwaresthus it includes TrafficSet 1 TrafficSet 3 is mainly used formalicious apps detection in local networks

62 Experimental Results

Accuracy of Traffic Clustering TrafficSet 1 is used for testingthe accuracy of app traffic clustering and Packet Capture isexploited to get the ground truth of the originator of networktraffic Recall that in Algorithm 1 app traffic clustering mustbe started with signature seeds In order to obtain appsrsquosignature seeds we randomly choose several HTTP flowsfor each app and extract the key-value pairs in their HTTPheaders as signature seeds as described in Section 4 Thevalue of cluster number 119901 in Algorithm 2 is set to 97 (97 =93 + 4) since there are 93 malwares and 4 background appsthat is weather forecasting email checking QQ andWebchatThe experimental results are shown in Table 6

As shown in Table 6 for each app when 5 HTTP flowsare chosen to generate signature seeds only 601 flows (in all223220 flows) aremisclusteredThis proves the efficiencies ofour proposed method for app traffic clustering

Experimental Results of Malicious App Detection Similarto previous work in order to measure the effectiveness ofmalicious apps detection accuracy (detection rate) and falsepositive rate metrics are defined in (17) and (18) where TPFN FP and TN stand for true positive false negative falsepositive and true negative respectively

Accuracy = TPTP + FN

(17)

False positive rate = FPFP + TN

(18)

We first examine the detection rates and false positiverates of AppFA with different values of 119905119904 and 119896119894 (119894 = 1 2)For the selected malicious apps the apps with the samefunctionality (satisfying condition A) are determined byGoogle Play We first analyze the malicious appsrsquo function-alities manually and choose a keyword for each app Then

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

05

06

07

08

09

1

Det

ectio

n ra

te

06 08 1 12 14 16 18 2 04ts

Figure 8 Detection rates of AppFA with different values of 119905119904 and119896119894 (119894 = 1 2)

we search the keyword in Google Play and use the returnedapps as candidates After that the peer group is determined by(8) and (9) In our experiments the largest value of membercount 119892 is set to 10 Note that 119905119904 is the threshold defined in(16) and 119896119894 is the number of KPCA transformations definedin Table 4 We set the member count 119892 in peer group analysisto 5 namely for each tested app its top 5 similar apps areselected and analyzedThedetection rate is shown in Figure 8and the false positive rate is depicted in Figure 9As illustratedin Figures 8 and 9 the first 4 KPCA transformations of packetsizes and intervals are good enough for constructing networkbehavior profileWhen 1198961 = 1198962 = 4 and 119905119904 gt 12 the detectionrate is higher than 90 and the false positive rate is lowerthan 04 The detection rate is as high as 97 when 119905119904 = 2The experimental results demonstrate the effectiveness of ourproposed approach

We then test AppFA with different value of 119892 The valuesof 119905119904 and 119896119894 (119894 = 1 2) are set to 2 and 4 respectivelyThe experimental results are shown in Figures 10 and 11 Asillustrated in these figures when 119892 increases the detectionrates and the false positive rates are slightly changed Thiscan be explained as the Mahalanobis distance 120591 defined in(11) is calculated by considering the weighted mean vector ofprofiles So the most similar app has the largest weight andthe added peers will have smaller weights and thus providelittle contribution to the final detection As observed in ourexperiments 119892 = 5 is suitable for practical deployment

Further AppFA with different value of cluster number 119901is also tested Note that the actual cluster number is 97 forTrafficSet 1 In order to evaluate how app traffic clusteringimpacts the detection of repackaged malware we slightlychange the value of 119901 and the experiments results are asshown in Figure 12 In the experiments we set 1198961 = 1198962 = 4119905119904 = 2 and 119892 = 5 As shown in Figure 12 the cluster numberindeed impacts the detection rate of AppFA Particularlywhen 119901 = 99 the detection rate is only 88 namely only 82

Security and Communication Networks 11

Table 5 Data summary

Traffic set Apps Quantity Number of collectedflows (TCP and UPD) Amount of data (bytes)

TrafficSet 1 Selected malicious apps(for testing accuracy of traffic clustering and detection rates) 93 223220 538G

TrafficSet 2 Benign apps(for testing false positive rates of AppFA) 100 305103 640G

TrafficSet 3 All malicious apps(for malicious apps detection in local networks) 1260 asymp31 million 1872G

Table 6 Experimental results of traffic clustering

Number of HTTP flows forgenerating signature seeds

Number of flowscorrectly clustered

Number ofmisclustered

flows2 201357 218633 218946 42744 221384 18365 222619 601

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

00005

0001

00015

0002

00025

0003

00035

0004

False

pos

itive

rate

06 08 1 12 14 16 18 2 04ts

Figure 9 False positive rates of AppFA with different values of 119905119904and 119896119894(119894 = 1 2)

malicious apps are correctly detected However when 119901 lt 97(119901 = 95 and 96 in Figure 12) the detection rates are almost thesame namely 97Therefore AppFAwill be competent witha few unknown apps when carrying out app traffic clustering

Finally we evaluate AppFA with the remaining maliciousapps besides the selected 93 samplesThe rest of themaliciousapps contain repackaged and other types of malwares Thesemalwares are found to have some errors when run on ourphones However they also produced considerable networktraffic We test AppFA with these apps for examining thegeneral capability of AppFA The parameters are set up asfollows 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 The experimentalresults are shown in Table 7 The detection rate is 734and the false positive rate is 35 Compared to the results

1 3 5 7 10g

Detection rate

0

01

02

03

04

05

06

07

08

09

1

rate

()

Figure 10 Detection rates of AppFA with different values of 119892

1 3 5 7 10g

False positive rate

times10minus3

0

1

2

3

4

5

6

7

rate

()

Figure 11 False positive rates of AppFA with different values of 119892

displayed in Figures 8 9 10 and 11 the detection rate issignificantly reduced and the false positive rate is increasedThe possible reasons will be discussed in Section 64

Note that AppFA is designed to perform nearly real-timemalicious apps detection thus efficiency is also a big concernTable 8 presents the computational performance for major

12 Security and Communication Networks

99989695value of p

0

01

02

03

04

05

06

07

08

09

1

dete

ctio

n ra

te (

)

Figure 12 Detection rates of AppFA with different values of 119901

Table 7 Experimental results for the malicious apps besides theselected 93 samples

Detection rate False positive rate734 35

Table 8 The computational performance

Procedures TimeConstrained flow clustering (Algorithm 1) 3minApp identification (Algorithm 2) 52minNetwork behavior profile construction 17minPeer group analysis 45 sec

steps in terms of average running time using the experimentaldata Generally the proposedmethod can be deployed onlineto detect malicious apps Specifically it is fast typically lessthan one minute to exploit peer group analysis to performmalicious app detection The app identification procedureis the most time-consuming step because the signaturematching and constrained flow clustering are carried out oneach flow Parallel computing such as cloud resources can befurther used to speed the malicious apps detection and scaleup to more Internet traffic data

Detection of Malicious Repackaged Apps for Encrypted TrafficAs shown in Table 4 AppFA does not consider traffic contentand mainly uses traffic statistical features to detect maliciousAndroid apps Thus AppFA can deal with encrypted trafficSince in MalGenome there is no app whose all connectionsto attack servers are encrypted we have created a newAndroid repackaged malware that communicates with allmalicious servers by encrypted connections In detail wefirst modify AnserverBotrsquos source code and make sure itsnetwork connectionswill be encrypted byTLS protocolThenwe graft it into a popular game app aircomaceviralmotox3Finally we collect network traffic of the repackaged aircomaceviralmotox3 and its peer group In the experiments

Table 9 Experimental results when searching similar apps locally

Detection rate False positive rate692 47

the selected peer group of the repackaged aircomaceviralmotox3 is comwordmobilesbikeRacing comtopfreegamesbikeracefreeworld comskgamestrafficrider comtomicowheel-iechallenge and madmotoracingbike We repeat to detectthe repackaged aircomaceviralmotox3 10 times and thedetection rate is 100 The results show that AppFA canhandle encrypted network traffic

63 Detection in Local Networks In the above experimentsthe peer group for each app is determinedwith the help of appstore such as theGoogle PlayThatmeansAppFAhas to accessthe Internet when performing malicious apps detection Infact AppFA can also be enhanced to choose the appsrsquo peergroups locally namely from the set of already identified appsNote that AppFA uses signature matching to identify appson the network and the identified ones can be treated as amini app store By this AppFA can work on local networksTo find out similar apps from the already identified ones anefficient method based on information retrieval technologiesis proposed and is shown in Algorithm 3

The basic idea of Algorithm 3 is that similar apps mayhave similar functionalities and the context (keywords) maybe similarTherefore we useweb-page searching technologiesto match similar app The 119872119886119905119888ℎ119894119899119892 function in line (4) canbe realized by TF-IDF (Term Frequency-Inverse DocumentFrequency) [43] Again peer groups can be chose by (8) withthe results returned by Algorithm 3

We have implemented the local similar app searching andevaluated AppFArsquos performance in local networks All appsincluding all malicious and benign apps are considered asidentified apps and Algorithm 3 is carried out to find outsimilar apps When 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 for the93 typical repackaged malwares the experimental results arelisted in Table 9 The detection rate is 692 lower than 97in Figure 8The experimental results indicate that AppFA canwork in local networks and the proper peer groups do affectthe final detection significantly

64 Discussion Figures 8 9 10 and 11 show that AppFAgets high detection rates and low false positive rates whendetecting repackaged apps However the detection rate maybecome lower when it is applied to other types of maliciousapps such as sending SMS without notification This may bedue to the fact that other types of mobile malwares transmitless data through network than repackaged apps Note thatthe detection features used inAppFA aremainly related to thepackets sizes and the number of flows So the detection ratemay be declined if the appsrsquo network behaviors are slightlychanged

Compared to the latest work done in [6] AppFA has aslightly lower detection rate (the detection rate is 95ndash999 asreported in [6]) However Garg et alrsquos method [6] consideredonly 18 malware apps and 14 genuine apps and assumed

Security and Communication Networks 13

(1) extract all plaintext keywords from HTTP headers of thetested app 119860 Denote the set of keywords as 119882(119860)

(2) extract all plaintext keywords from HTTP headers of otheridentified apps Denote the set of keywords of identifiedapp 119894 as 119882(119894) 119894 = 0 1 119899 minus 1

(3) while 0 le 119899 do(4) 119878119888119900119903119890119904[119894]V119886119897119906119890 = 119872119886119905119888ℎ119894119899119892(119882(119860) 119882(119894))(5) 119878119888119900119903119890119904[119894]119886119901119901 = 119894(6) 119894 + +(7) end while(8) sorting 119878119888119900119903119890119904[] with 119878119888119900119903119890119904[119894]V119886119897119906119890 by decreasing order(9) return the first 119898119878119888119900119903119890119904[119894]119886119901119901 119894 = 0 1 119898 minus 1

Algorithm 3 Searching similar apps locally

Table 10 Experimental results when the fact that what traffic isgenerated by what app is exactly known In the experiments 1198961 =1198962 = 4 119905119904 = 2 and 119892 = 5

Selected maliciousapps

Remaining maliciousapps

Detection rate 983 85False positiverate 003 27

that the traffic generated by apps was already known (theirdata were collected on the mobile device so their methodis actually a kind of client-side approach) In our workmuch larger samples are taken into account and constrainedclustering is exploited to determine to what app the flowsbelong The accuracy of clustering is also a factor affectingthe detection rate To confirm this we further take the sameassumption as Garg et alrsquos method [6] namely we assumethe accuracy of Algorithm 1 is 100 and know exactly whattraffic is generated by what app (in our experiments we useapp Packet Capture to obtain perfect ground truth of whatflows came from what app) At this time the experimentalresults are given in Table 10 Apparently the detection ratesare much improved

In practice one can deploy both our method and othermethods such as [6 15] simultaneously As indicated by ourexperiments AppFA is efficient in detecting repackaged andself-updatingmalicious apps In fact [15] is suitable forHTTPanalysis and [6] can detect CampC communication efficientlySo our work can be a complement of existing work onmalicious app detection

In this work AppFA mainly uses the statistical features(refer to Table 4) of network traffic to detect Android mali-cious repackaged applications Therefore attackers (repack-aged apps) may change the characteristics of their traffic toremain undetectable But it is quite difficult in practice Asillustrated in Section 2 repackaged apps usually introduceadditional network traffic thus the attackers must removesome normal network connections to keep network behav-iors the same to evade detection However the removal of

normal network connections will impact the functionalitiesof apps and may cause errors This gives additional clues todetect repackaged malware Meanwhile the disappearanceof some network connections may be abnormal as wellTherefore our proposed method is hard to evade

7 Conclusion

In this paper we propose a novel approach AppFA to detectmalicious apps at the network level In AppFA apps are firstidentified from network traffic by signature matching andconstrained clustering Then Kernel Principal ComponentAnalysis is employed to construct app network behavior pro-file and distinguish minor traffic variations from significantdifferences At last we take advantage of peer group analysisto detect malicious apps to avoid time-consuming offlinemodel training Notably AppFA does not need to installprograms or modify operating systems to collect featureinformation Thus it is very convenient to be used and thecost is low The experimental results show that AppFA candetect Android repackaged malware with the detection ratehigher than 90 and a false positive rate lower than 04

The appsrsquo network behaviorsmay be significantly changedby version update and thus cause false positives This needsto be further investigated In the future work AppFA willbe extended to include more network traffic features anddetect more types of malicious apps We will also makeAppFA an open-source project to facilitate further researchesof malicious app detection

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by National Natural Science Founda-tion of China under Grants 61702282 and 61502250 NaturalScience Foundation of the Jiangsu Higher Education Insti-tutions of China under Grant 17KJB520023 NUPTSF underGrant NY217143 and Nanjing Forestry University (GXL016CX2016026)

14 Security and Communication Networks

References

[1] Android statistics Number of android applications September2017 httpwwwappbraincomstatsnumber-of-android-apps

[2] A Mylonas M Theoharidou and D Gritzalis ldquoAssessingPrivacy Risks in Android A User-Centric Approachrdquo in RiskAssessment and Risk-Driven Testing Lecture Notes in ComputerScience pp 21ndash37 Springer International Publishing Cham2014

[3] Does Does your mobile anti-virus app protect or infect you thetruth behind du antivirus security September 2017 httpsresearchcheckpointcommobile-anti-virus-app-protect-infect-truth-behind-du-antivirus-security

[4] A P Felt M Finifter E Chin S Hanna and D Wagner ldquoAsurvey of mobile malware in the wildrdquo in Proceedings of the 1stACM Workshop on Security and Privacy in Smartphones andMobile Devices (SPSM rsquo11) held in Association with the 18th ACMConference on Computer and Communications Security (CCSrsquo11) pp 3ndash14 October 2011

[5] K Tam A Feizollah N B Anuar R Salleh and L Caval-laro ldquoThe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys vol 49 no 4 Article ID3017427 2017

[6] S Garg S K Peddoju and A K Sarje ldquoNetwork-based detec-tion of Android malicious appsrdquo International Journal of Infor-mation Security pp 1ndash16 2016

[7] A Shabtai L Tenenboim-Chekina D Mimran L Rokach BShapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquo Com-puters amp Security vol 43 pp 1ndash18 2014

[8] L Xie X Zhang J-P Seifert and S Zhu ldquoPBMDS a behavior-based malware detection system for cellphone devicesrdquo inProceedings of the 3rd ACM Conference on Wireless NetworkSecurity (WiSec rsquo10) pp 37ndash48 Hoboken NJ USAMarch 2010

[9] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacy inSmartphones and Mobile Devices (SPSM rsquo11) Held in Associationwith the 18th ACM Conference on Computer and Communica-tions Security (CCS rsquo11) pp 15ndash25 October 2011

[10] GDini FMartinelli A Saracino andD Sgandurra ldquoMADAMa multi-level anomaly detector for android malwarerdquo in Com-puter Network Security vol 7531 of Lecture Notes in ComputerScience pp 240ndash253 Springer Berlin Germany 2012

[11] Y Zhang M Yang B Xu et al ldquoVetting undesirable behaviorsin android apps with permission use analysisrdquo in Proceedingsof the in Proceedings of the 2013 ACM SIGSAC conference onComputer amp communications security pp 611ndash622 2013

[12] S Wang Z Chen X Li L Wang K Ji and C Zhao ldquoAndroidMalware Clustering Analysis on Network-Level Behaviorrdquo inIntelligent Computing Theories and Application vol 10361 ofLecture Notes in Computer Science pp 796ndash807 SpringerInternational Publishing Cham Switzerland 2017

[13] A S Shamili C Bauckhage and T Alpcan ldquoMalware detectionon mobile devices using distributed machine learningrdquo inProceedings of the 20th International Conference on PatternRecognition (ICPR rsquo10) pp 4348ndash4351 August 2010

[14] Y Wang J Wei and K Vangury ldquoBring your own device secu-rity issues and challengesrdquo in Proceedings of the 2014 IEEE 11thConsumer Communications and Networking Conference CCNC2014 pp 80ndash85 USA January 2014

[15] P S Chen S-C Lin and C-H Sun ldquoSimple and effectivemethod for detecting abnormal internet behaviors of mobiledevicesrdquo Information Sciences vol 321 Article ID 11536 pp 193ndash204 2015

[16] Global mobile os market share in sales to end users from 1stquarter 2009 to 1st quarter 2017 2017 httpswwwstatistacomstatistics266136global-market-share-held-by-smartphone-op-eratingsystems

[17] ldquoIt threat evolution q3 2017 statisticsrdquo November 2017 httpssecurelistcomit-threat-evolution-q3-2017-statistics83131

[18] L Li D Li T F Bissyande et al ldquoUnderstanding Android AppPiggybacking A Systematic Study of Malicious Code GraftingrdquoIEEE Transactions on Information Forensics and Security vol 12no 6 pp 1269ndash1284 2017

[19] Y Zhou and X Jiang ldquoDissecting android malware characteri-zation and evolutionrdquo in Proceedings of the 33rd IEEE Sympo-sium on Security and Privacy pp 95ndash109 San Francisco CalifUSA May 2012

[20] P Faruki V Laxmi A Bharmal M S Gaur and V Gan-moor ldquoAndroSimilar Robust signature for detecting variantsof Android malwarerdquo Journal of Information Security andApplications vol 22 pp 66ndash80 2015

[21] Y Zhou and X JiangAn analysis of the anserverbot trojan 2011httpwwwcscncsuedufacultyjiangpubsAnserverBot Ana-lysispdf

[22] P Faruki A Bharmal V Laxmi et al ldquoAndroid security a sur-vey of issues malware penetration and defensesrdquo IEEE Com-munications Surveys amp Tutorials vol 17 no 2 pp 998ndash10222015

[23] P Yan and Z Yan ldquoA survey on dynamic mobile malwaredetectionrdquo Software Quality Journal pp 1ndash29 2017

[24] S Dai Y Liu T Wang T Wei and W Zou ldquoBehavior-basedmalware detection on mobile phonerdquo in Proceedings of the 20106th International Conference on Wireless Communications Net-working and Mobile Computing WiCOM 2010 China Septem-ber 2010

[25] D Damopoulos G Kambourakis S Gritzalis and S O ParkldquoExposing mobile malware from the inside (or what is yourmobile app really doing)rdquo Peer-to-Peer Networking and Appli-cations vol 7 no 4 pp 687ndash697 2014

[26] D Damopoulos G Kambourakis and G Portokalidis ldquoThebest of both worlds A framework for the synergistic operationof host and cloud anomaly-based IDS for smartphonesrdquo inProceedings of the 7th European Workshop on System SecurityEuroSec 2014 Netherlands April 2014

[27] Q Xu J Erman A Gerber Z Mao J Pang and S Venkatara-man ldquoIdentifying diverse usage behaviors of smartphone appsrdquoin Proceedings of the ACM SIGCOMM Internet MeasurementConference (IMC rsquo11) pp 329ndash344 November 2011

[28] S Dai A Tongaonkar X Wang A Nucci and D Song ldquoNet-workProfiler Towards automatic fingerprinting of Androidappsrdquo in Proceedings of the 32nd IEEE Conference on ComputerCommunications IEEE INFOCOM 2013 pp 809ndash817 ItalyApril 2013

[29] Q Xu T Andrews Y Liao et al ldquoFlowr A self-learning systemfor classifying mobile application trafficrdquo in Proceedings of the2014 ACM SIGMETRICS International Conference on Measure-ment and Modeling of Computer Systems SIGMETRICS 2014pp 569-570 usa June 2014

[30] Q Xu Y Liao S Miskovic et al ldquoAutomatic generation ofmobile app signatures from traffic observationsrdquo in Proceedings

Security and Communication Networks 15

of the 34th IEEE Annual Conference on Computer Communica-tions and Networks IEEE INFOCOM 2015 pp 1481ndash1489 HongKong May 2015

[31] J Sun L She H Chen et al ldquoAutomatically identifying appsin mobile trafficrdquo Concurrency Computation vol 28 no 14 pp3927ndash3941 2016

[32] G He B Xu and H Zhu ldquoIdentifying Mobile Applications forEncrypted Network Trafficrdquo in Proceedings of the 5th Interna-tional Conference on Advanced Cloud and Big Data CBD 2017pp 279ndash284 China August 2017

[33] G He M Yang X Gu J Luo and Y Ma ldquoA novel activewebsite fingerprinting attack against Tor anonymous systemrdquoin Proceedings of the 2014 18th IEEE International Conference onComputer Supported CooperativeWork inDesign CSCWD2014pp 112ndash117 Taiwan May 2014

[34] G He M Yang J Luo and X Gu ldquoA novel application classi-fication attack against Torrdquo Concurrency and ComputationPractice and Experience vol 27 no 18 pp 5640ndash5661 2015

[35] B Miller L Huang A D Joseph and J D Tygar ldquoI know whyyou went to the clinic Risks and realization of HTTPS trafficanalysisrdquo Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes in Bio-informatics) Preface vol 8555 pp 143ndash163 2014

[36] Y Wang Y Xiang J Zhang W Zhou G Wei and L T YangldquoInternet traffic classification using constrained clusteringrdquoIEEE Transactions on Parallel and Distributed Systems vol 25no 11 pp 2932ndash2943 2014

[37] L J Cao K S Chua W K Chong H P Lee and Q M Gu ldquoAcomparison of PCA KPCA and ICA for dimensionality reduc-tion in support vector machinerdquo Neurocomputing vol 55 no1-2 pp 321ndash336 2003

[38] Y Kim and S Y Sohn ldquoStock fraud detection using peer groupanalysisrdquo Expert Systems with Applications vol 39 no 10 pp8986ndash8992 2012

[39] ldquoOurmon network monitoring and anomaly detection systemrdquoApril 2013 httpourmonsourceforgenet

[40] B Babaki T Guns and S Nijssen ldquoConstrained ClusteringUsing Column Generationrdquo in Integration of AI and OR Tech-niques in Constraint Programming vol 8451 of Lecture Notes inComputer Science pp 438ndash454 Springer International Publish-ing Cham 2014

[41] A Abraham R Andriatsimandefitra A Brunelat J-F Lalandeand V Viet Triem Tong ldquoGroddDroid A gorilla for triggeringmalicious behaviorsrdquo in Proceedings of the 10th InternationalConference on Malicious and Unwanted Software MALWARE2015 pp 119ndash127 USA October 2015

[42] Y Li Z Yang Y Guo and X Chen ldquoDroidBot A lightweightUI-guided test input generator for androidrdquo inProceedings of the39th IEEEACM International Conference on Software Engineer-ing Companion ICSE-C 2017 pp 23ndash26 Argentina May 2017

[43] A Aizawa ldquoAn information-theoretic perspective of tf-idfmeasuresrdquo Information Processing amp Management vol 39 no1 pp 45ndash65 2003

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 10: AppFA: A Novel Approach to Detect Malicious …downloads.hindawi.com/journals/scn/2018/2854728.pdfSecurityandCommunicationNetworks F :Exampleofkey-valuepairs. signs, and the packet

10 Security and Communication Networks

Section 63 Particularly the background traffic includingweather forecasting email checking QQ and Webchat (QQand Webchat are the most famous apps in China) is allowedin our experiments The ground truth of the originator ofnetwork traffic is determined by Packet Capture (httpsplaygooglecomstoreappsdetailsid=appgreyshirtssslcaptureamphl=zh) to test the accuracy of app traffic clustering

For testing the false positive rates of AppFA we rely onthe GooglePlayAppsCrawlerpy project to identify the 100most popular free apps These 100 free apps have also beenrun one by one 50 times While different from malwaresthe benign apps are run by droidbot [42] to make sure togenerate enough traffic With these captured traffic we takeup all nonzero packets for app traffic clustering and networkbehavior profile construction Namely in our experimentsthe value of 119899 in Packet filter model is set to minus1

The data used in the experiments is summarized inTable 5 TrafficSet 1 consists of network traffic generated bythe selected 93 malwares and TrafficSet 2 consists of networktraffic of benign apps TrafficSet 3 is the traffic of all malwaresthus it includes TrafficSet 1 TrafficSet 3 is mainly used formalicious apps detection in local networks

62 Experimental Results

Accuracy of Traffic Clustering TrafficSet 1 is used for testingthe accuracy of app traffic clustering and Packet Capture isexploited to get the ground truth of the originator of networktraffic Recall that in Algorithm 1 app traffic clustering mustbe started with signature seeds In order to obtain appsrsquosignature seeds we randomly choose several HTTP flowsfor each app and extract the key-value pairs in their HTTPheaders as signature seeds as described in Section 4 Thevalue of cluster number 119901 in Algorithm 2 is set to 97 (97 =93 + 4) since there are 93 malwares and 4 background appsthat is weather forecasting email checking QQ andWebchatThe experimental results are shown in Table 6

As shown in Table 6 for each app when 5 HTTP flowsare chosen to generate signature seeds only 601 flows (in all223220 flows) aremisclusteredThis proves the efficiencies ofour proposed method for app traffic clustering

Experimental Results of Malicious App Detection Similarto previous work in order to measure the effectiveness ofmalicious apps detection accuracy (detection rate) and falsepositive rate metrics are defined in (17) and (18) where TPFN FP and TN stand for true positive false negative falsepositive and true negative respectively

Accuracy = TPTP + FN

(17)

False positive rate = FPFP + TN

(18)

We first examine the detection rates and false positiverates of AppFA with different values of 119905119904 and 119896119894 (119894 = 1 2)For the selected malicious apps the apps with the samefunctionality (satisfying condition A) are determined byGoogle Play We first analyze the malicious appsrsquo function-alities manually and choose a keyword for each app Then

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

05

06

07

08

09

1

Det

ectio

n ra

te

06 08 1 12 14 16 18 2 04ts

Figure 8 Detection rates of AppFA with different values of 119905119904 and119896119894 (119894 = 1 2)

we search the keyword in Google Play and use the returnedapps as candidates After that the peer group is determined by(8) and (9) In our experiments the largest value of membercount 119892 is set to 10 Note that 119905119904 is the threshold defined in(16) and 119896119894 is the number of KPCA transformations definedin Table 4 We set the member count 119892 in peer group analysisto 5 namely for each tested app its top 5 similar apps areselected and analyzedThedetection rate is shown in Figure 8and the false positive rate is depicted in Figure 9As illustratedin Figures 8 and 9 the first 4 KPCA transformations of packetsizes and intervals are good enough for constructing networkbehavior profileWhen 1198961 = 1198962 = 4 and 119905119904 gt 12 the detectionrate is higher than 90 and the false positive rate is lowerthan 04 The detection rate is as high as 97 when 119905119904 = 2The experimental results demonstrate the effectiveness of ourproposed approach

We then test AppFA with different value of 119892 The valuesof 119905119904 and 119896119894 (119894 = 1 2) are set to 2 and 4 respectivelyThe experimental results are shown in Figures 10 and 11 Asillustrated in these figures when 119892 increases the detectionrates and the false positive rates are slightly changed Thiscan be explained as the Mahalanobis distance 120591 defined in(11) is calculated by considering the weighted mean vector ofprofiles So the most similar app has the largest weight andthe added peers will have smaller weights and thus providelittle contribution to the final detection As observed in ourexperiments 119892 = 5 is suitable for practical deployment

Further AppFA with different value of cluster number 119901is also tested Note that the actual cluster number is 97 forTrafficSet 1 In order to evaluate how app traffic clusteringimpacts the detection of repackaged malware we slightlychange the value of 119901 and the experiments results are asshown in Figure 12 In the experiments we set 1198961 = 1198962 = 4119905119904 = 2 and 119892 = 5 As shown in Figure 12 the cluster numberindeed impacts the detection rate of AppFA Particularlywhen 119901 = 99 the detection rate is only 88 namely only 82

Security and Communication Networks 11

Table 5 Data summary

Traffic set Apps Quantity Number of collectedflows (TCP and UPD) Amount of data (bytes)

TrafficSet 1 Selected malicious apps(for testing accuracy of traffic clustering and detection rates) 93 223220 538G

TrafficSet 2 Benign apps(for testing false positive rates of AppFA) 100 305103 640G

TrafficSet 3 All malicious apps(for malicious apps detection in local networks) 1260 asymp31 million 1872G

Table 6 Experimental results of traffic clustering

Number of HTTP flows forgenerating signature seeds

Number of flowscorrectly clustered

Number ofmisclustered

flows2 201357 218633 218946 42744 221384 18365 222619 601

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

00005

0001

00015

0002

00025

0003

00035

0004

False

pos

itive

rate

06 08 1 12 14 16 18 2 04ts

Figure 9 False positive rates of AppFA with different values of 119905119904and 119896119894(119894 = 1 2)

malicious apps are correctly detected However when 119901 lt 97(119901 = 95 and 96 in Figure 12) the detection rates are almost thesame namely 97Therefore AppFAwill be competent witha few unknown apps when carrying out app traffic clustering

Finally we evaluate AppFA with the remaining maliciousapps besides the selected 93 samplesThe rest of themaliciousapps contain repackaged and other types of malwares Thesemalwares are found to have some errors when run on ourphones However they also produced considerable networktraffic We test AppFA with these apps for examining thegeneral capability of AppFA The parameters are set up asfollows 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 The experimentalresults are shown in Table 7 The detection rate is 734and the false positive rate is 35 Compared to the results

1 3 5 7 10g

Detection rate

0

01

02

03

04

05

06

07

08

09

1

rate

()

Figure 10 Detection rates of AppFA with different values of 119892

1 3 5 7 10g

False positive rate

times10minus3

0

1

2

3

4

5

6

7

rate

()

Figure 11 False positive rates of AppFA with different values of 119892

displayed in Figures 8 9 10 and 11 the detection rate issignificantly reduced and the false positive rate is increasedThe possible reasons will be discussed in Section 64

Note that AppFA is designed to perform nearly real-timemalicious apps detection thus efficiency is also a big concernTable 8 presents the computational performance for major

12 Security and Communication Networks

99989695value of p

0

01

02

03

04

05

06

07

08

09

1

dete

ctio

n ra

te (

)

Figure 12 Detection rates of AppFA with different values of 119901

Table 7 Experimental results for the malicious apps besides theselected 93 samples

Detection rate False positive rate734 35

Table 8 The computational performance

Procedures TimeConstrained flow clustering (Algorithm 1) 3minApp identification (Algorithm 2) 52minNetwork behavior profile construction 17minPeer group analysis 45 sec

steps in terms of average running time using the experimentaldata Generally the proposedmethod can be deployed onlineto detect malicious apps Specifically it is fast typically lessthan one minute to exploit peer group analysis to performmalicious app detection The app identification procedureis the most time-consuming step because the signaturematching and constrained flow clustering are carried out oneach flow Parallel computing such as cloud resources can befurther used to speed the malicious apps detection and scaleup to more Internet traffic data

Detection of Malicious Repackaged Apps for Encrypted TrafficAs shown in Table 4 AppFA does not consider traffic contentand mainly uses traffic statistical features to detect maliciousAndroid apps Thus AppFA can deal with encrypted trafficSince in MalGenome there is no app whose all connectionsto attack servers are encrypted we have created a newAndroid repackaged malware that communicates with allmalicious servers by encrypted connections In detail wefirst modify AnserverBotrsquos source code and make sure itsnetwork connectionswill be encrypted byTLS protocolThenwe graft it into a popular game app aircomaceviralmotox3Finally we collect network traffic of the repackaged aircomaceviralmotox3 and its peer group In the experiments

Table 9 Experimental results when searching similar apps locally

Detection rate False positive rate692 47

the selected peer group of the repackaged aircomaceviralmotox3 is comwordmobilesbikeRacing comtopfreegamesbikeracefreeworld comskgamestrafficrider comtomicowheel-iechallenge and madmotoracingbike We repeat to detectthe repackaged aircomaceviralmotox3 10 times and thedetection rate is 100 The results show that AppFA canhandle encrypted network traffic

63 Detection in Local Networks In the above experimentsthe peer group for each app is determinedwith the help of appstore such as theGoogle PlayThatmeansAppFAhas to accessthe Internet when performing malicious apps detection Infact AppFA can also be enhanced to choose the appsrsquo peergroups locally namely from the set of already identified appsNote that AppFA uses signature matching to identify appson the network and the identified ones can be treated as amini app store By this AppFA can work on local networksTo find out similar apps from the already identified ones anefficient method based on information retrieval technologiesis proposed and is shown in Algorithm 3

The basic idea of Algorithm 3 is that similar apps mayhave similar functionalities and the context (keywords) maybe similarTherefore we useweb-page searching technologiesto match similar app The 119872119886119905119888ℎ119894119899119892 function in line (4) canbe realized by TF-IDF (Term Frequency-Inverse DocumentFrequency) [43] Again peer groups can be chose by (8) withthe results returned by Algorithm 3

We have implemented the local similar app searching andevaluated AppFArsquos performance in local networks All appsincluding all malicious and benign apps are considered asidentified apps and Algorithm 3 is carried out to find outsimilar apps When 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 for the93 typical repackaged malwares the experimental results arelisted in Table 9 The detection rate is 692 lower than 97in Figure 8The experimental results indicate that AppFA canwork in local networks and the proper peer groups do affectthe final detection significantly

64 Discussion Figures 8 9 10 and 11 show that AppFAgets high detection rates and low false positive rates whendetecting repackaged apps However the detection rate maybecome lower when it is applied to other types of maliciousapps such as sending SMS without notification This may bedue to the fact that other types of mobile malwares transmitless data through network than repackaged apps Note thatthe detection features used inAppFA aremainly related to thepackets sizes and the number of flows So the detection ratemay be declined if the appsrsquo network behaviors are slightlychanged

Compared to the latest work done in [6] AppFA has aslightly lower detection rate (the detection rate is 95ndash999 asreported in [6]) However Garg et alrsquos method [6] consideredonly 18 malware apps and 14 genuine apps and assumed

Security and Communication Networks 13

(1) extract all plaintext keywords from HTTP headers of thetested app 119860 Denote the set of keywords as 119882(119860)

(2) extract all plaintext keywords from HTTP headers of otheridentified apps Denote the set of keywords of identifiedapp 119894 as 119882(119894) 119894 = 0 1 119899 minus 1

(3) while 0 le 119899 do(4) 119878119888119900119903119890119904[119894]V119886119897119906119890 = 119872119886119905119888ℎ119894119899119892(119882(119860) 119882(119894))(5) 119878119888119900119903119890119904[119894]119886119901119901 = 119894(6) 119894 + +(7) end while(8) sorting 119878119888119900119903119890119904[] with 119878119888119900119903119890119904[119894]V119886119897119906119890 by decreasing order(9) return the first 119898119878119888119900119903119890119904[119894]119886119901119901 119894 = 0 1 119898 minus 1

Algorithm 3 Searching similar apps locally

Table 10 Experimental results when the fact that what traffic isgenerated by what app is exactly known In the experiments 1198961 =1198962 = 4 119905119904 = 2 and 119892 = 5

Selected maliciousapps

Remaining maliciousapps

Detection rate 983 85False positiverate 003 27

that the traffic generated by apps was already known (theirdata were collected on the mobile device so their methodis actually a kind of client-side approach) In our workmuch larger samples are taken into account and constrainedclustering is exploited to determine to what app the flowsbelong The accuracy of clustering is also a factor affectingthe detection rate To confirm this we further take the sameassumption as Garg et alrsquos method [6] namely we assumethe accuracy of Algorithm 1 is 100 and know exactly whattraffic is generated by what app (in our experiments we useapp Packet Capture to obtain perfect ground truth of whatflows came from what app) At this time the experimentalresults are given in Table 10 Apparently the detection ratesare much improved

In practice one can deploy both our method and othermethods such as [6 15] simultaneously As indicated by ourexperiments AppFA is efficient in detecting repackaged andself-updatingmalicious apps In fact [15] is suitable forHTTPanalysis and [6] can detect CampC communication efficientlySo our work can be a complement of existing work onmalicious app detection

In this work AppFA mainly uses the statistical features(refer to Table 4) of network traffic to detect Android mali-cious repackaged applications Therefore attackers (repack-aged apps) may change the characteristics of their traffic toremain undetectable But it is quite difficult in practice Asillustrated in Section 2 repackaged apps usually introduceadditional network traffic thus the attackers must removesome normal network connections to keep network behav-iors the same to evade detection However the removal of

normal network connections will impact the functionalitiesof apps and may cause errors This gives additional clues todetect repackaged malware Meanwhile the disappearanceof some network connections may be abnormal as wellTherefore our proposed method is hard to evade

7 Conclusion

In this paper we propose a novel approach AppFA to detectmalicious apps at the network level In AppFA apps are firstidentified from network traffic by signature matching andconstrained clustering Then Kernel Principal ComponentAnalysis is employed to construct app network behavior pro-file and distinguish minor traffic variations from significantdifferences At last we take advantage of peer group analysisto detect malicious apps to avoid time-consuming offlinemodel training Notably AppFA does not need to installprograms or modify operating systems to collect featureinformation Thus it is very convenient to be used and thecost is low The experimental results show that AppFA candetect Android repackaged malware with the detection ratehigher than 90 and a false positive rate lower than 04

The appsrsquo network behaviorsmay be significantly changedby version update and thus cause false positives This needsto be further investigated In the future work AppFA willbe extended to include more network traffic features anddetect more types of malicious apps We will also makeAppFA an open-source project to facilitate further researchesof malicious app detection

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by National Natural Science Founda-tion of China under Grants 61702282 and 61502250 NaturalScience Foundation of the Jiangsu Higher Education Insti-tutions of China under Grant 17KJB520023 NUPTSF underGrant NY217143 and Nanjing Forestry University (GXL016CX2016026)

14 Security and Communication Networks

References

[1] Android statistics Number of android applications September2017 httpwwwappbraincomstatsnumber-of-android-apps

[2] A Mylonas M Theoharidou and D Gritzalis ldquoAssessingPrivacy Risks in Android A User-Centric Approachrdquo in RiskAssessment and Risk-Driven Testing Lecture Notes in ComputerScience pp 21ndash37 Springer International Publishing Cham2014

[3] Does Does your mobile anti-virus app protect or infect you thetruth behind du antivirus security September 2017 httpsresearchcheckpointcommobile-anti-virus-app-protect-infect-truth-behind-du-antivirus-security

[4] A P Felt M Finifter E Chin S Hanna and D Wagner ldquoAsurvey of mobile malware in the wildrdquo in Proceedings of the 1stACM Workshop on Security and Privacy in Smartphones andMobile Devices (SPSM rsquo11) held in Association with the 18th ACMConference on Computer and Communications Security (CCSrsquo11) pp 3ndash14 October 2011

[5] K Tam A Feizollah N B Anuar R Salleh and L Caval-laro ldquoThe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys vol 49 no 4 Article ID3017427 2017

[6] S Garg S K Peddoju and A K Sarje ldquoNetwork-based detec-tion of Android malicious appsrdquo International Journal of Infor-mation Security pp 1ndash16 2016

[7] A Shabtai L Tenenboim-Chekina D Mimran L Rokach BShapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquo Com-puters amp Security vol 43 pp 1ndash18 2014

[8] L Xie X Zhang J-P Seifert and S Zhu ldquoPBMDS a behavior-based malware detection system for cellphone devicesrdquo inProceedings of the 3rd ACM Conference on Wireless NetworkSecurity (WiSec rsquo10) pp 37ndash48 Hoboken NJ USAMarch 2010

[9] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacy inSmartphones and Mobile Devices (SPSM rsquo11) Held in Associationwith the 18th ACM Conference on Computer and Communica-tions Security (CCS rsquo11) pp 15ndash25 October 2011

[10] GDini FMartinelli A Saracino andD Sgandurra ldquoMADAMa multi-level anomaly detector for android malwarerdquo in Com-puter Network Security vol 7531 of Lecture Notes in ComputerScience pp 240ndash253 Springer Berlin Germany 2012

[11] Y Zhang M Yang B Xu et al ldquoVetting undesirable behaviorsin android apps with permission use analysisrdquo in Proceedingsof the in Proceedings of the 2013 ACM SIGSAC conference onComputer amp communications security pp 611ndash622 2013

[12] S Wang Z Chen X Li L Wang K Ji and C Zhao ldquoAndroidMalware Clustering Analysis on Network-Level Behaviorrdquo inIntelligent Computing Theories and Application vol 10361 ofLecture Notes in Computer Science pp 796ndash807 SpringerInternational Publishing Cham Switzerland 2017

[13] A S Shamili C Bauckhage and T Alpcan ldquoMalware detectionon mobile devices using distributed machine learningrdquo inProceedings of the 20th International Conference on PatternRecognition (ICPR rsquo10) pp 4348ndash4351 August 2010

[14] Y Wang J Wei and K Vangury ldquoBring your own device secu-rity issues and challengesrdquo in Proceedings of the 2014 IEEE 11thConsumer Communications and Networking Conference CCNC2014 pp 80ndash85 USA January 2014

[15] P S Chen S-C Lin and C-H Sun ldquoSimple and effectivemethod for detecting abnormal internet behaviors of mobiledevicesrdquo Information Sciences vol 321 Article ID 11536 pp 193ndash204 2015

[16] Global mobile os market share in sales to end users from 1stquarter 2009 to 1st quarter 2017 2017 httpswwwstatistacomstatistics266136global-market-share-held-by-smartphone-op-eratingsystems

[17] ldquoIt threat evolution q3 2017 statisticsrdquo November 2017 httpssecurelistcomit-threat-evolution-q3-2017-statistics83131

[18] L Li D Li T F Bissyande et al ldquoUnderstanding Android AppPiggybacking A Systematic Study of Malicious Code GraftingrdquoIEEE Transactions on Information Forensics and Security vol 12no 6 pp 1269ndash1284 2017

[19] Y Zhou and X Jiang ldquoDissecting android malware characteri-zation and evolutionrdquo in Proceedings of the 33rd IEEE Sympo-sium on Security and Privacy pp 95ndash109 San Francisco CalifUSA May 2012

[20] P Faruki V Laxmi A Bharmal M S Gaur and V Gan-moor ldquoAndroSimilar Robust signature for detecting variantsof Android malwarerdquo Journal of Information Security andApplications vol 22 pp 66ndash80 2015

[21] Y Zhou and X JiangAn analysis of the anserverbot trojan 2011httpwwwcscncsuedufacultyjiangpubsAnserverBot Ana-lysispdf

[22] P Faruki A Bharmal V Laxmi et al ldquoAndroid security a sur-vey of issues malware penetration and defensesrdquo IEEE Com-munications Surveys amp Tutorials vol 17 no 2 pp 998ndash10222015

[23] P Yan and Z Yan ldquoA survey on dynamic mobile malwaredetectionrdquo Software Quality Journal pp 1ndash29 2017

[24] S Dai Y Liu T Wang T Wei and W Zou ldquoBehavior-basedmalware detection on mobile phonerdquo in Proceedings of the 20106th International Conference on Wireless Communications Net-working and Mobile Computing WiCOM 2010 China Septem-ber 2010

[25] D Damopoulos G Kambourakis S Gritzalis and S O ParkldquoExposing mobile malware from the inside (or what is yourmobile app really doing)rdquo Peer-to-Peer Networking and Appli-cations vol 7 no 4 pp 687ndash697 2014

[26] D Damopoulos G Kambourakis and G Portokalidis ldquoThebest of both worlds A framework for the synergistic operationof host and cloud anomaly-based IDS for smartphonesrdquo inProceedings of the 7th European Workshop on System SecurityEuroSec 2014 Netherlands April 2014

[27] Q Xu J Erman A Gerber Z Mao J Pang and S Venkatara-man ldquoIdentifying diverse usage behaviors of smartphone appsrdquoin Proceedings of the ACM SIGCOMM Internet MeasurementConference (IMC rsquo11) pp 329ndash344 November 2011

[28] S Dai A Tongaonkar X Wang A Nucci and D Song ldquoNet-workProfiler Towards automatic fingerprinting of Androidappsrdquo in Proceedings of the 32nd IEEE Conference on ComputerCommunications IEEE INFOCOM 2013 pp 809ndash817 ItalyApril 2013

[29] Q Xu T Andrews Y Liao et al ldquoFlowr A self-learning systemfor classifying mobile application trafficrdquo in Proceedings of the2014 ACM SIGMETRICS International Conference on Measure-ment and Modeling of Computer Systems SIGMETRICS 2014pp 569-570 usa June 2014

[30] Q Xu Y Liao S Miskovic et al ldquoAutomatic generation ofmobile app signatures from traffic observationsrdquo in Proceedings

Security and Communication Networks 15

of the 34th IEEE Annual Conference on Computer Communica-tions and Networks IEEE INFOCOM 2015 pp 1481ndash1489 HongKong May 2015

[31] J Sun L She H Chen et al ldquoAutomatically identifying appsin mobile trafficrdquo Concurrency Computation vol 28 no 14 pp3927ndash3941 2016

[32] G He B Xu and H Zhu ldquoIdentifying Mobile Applications forEncrypted Network Trafficrdquo in Proceedings of the 5th Interna-tional Conference on Advanced Cloud and Big Data CBD 2017pp 279ndash284 China August 2017

[33] G He M Yang X Gu J Luo and Y Ma ldquoA novel activewebsite fingerprinting attack against Tor anonymous systemrdquoin Proceedings of the 2014 18th IEEE International Conference onComputer Supported CooperativeWork inDesign CSCWD2014pp 112ndash117 Taiwan May 2014

[34] G He M Yang J Luo and X Gu ldquoA novel application classi-fication attack against Torrdquo Concurrency and ComputationPractice and Experience vol 27 no 18 pp 5640ndash5661 2015

[35] B Miller L Huang A D Joseph and J D Tygar ldquoI know whyyou went to the clinic Risks and realization of HTTPS trafficanalysisrdquo Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes in Bio-informatics) Preface vol 8555 pp 143ndash163 2014

[36] Y Wang Y Xiang J Zhang W Zhou G Wei and L T YangldquoInternet traffic classification using constrained clusteringrdquoIEEE Transactions on Parallel and Distributed Systems vol 25no 11 pp 2932ndash2943 2014

[37] L J Cao K S Chua W K Chong H P Lee and Q M Gu ldquoAcomparison of PCA KPCA and ICA for dimensionality reduc-tion in support vector machinerdquo Neurocomputing vol 55 no1-2 pp 321ndash336 2003

[38] Y Kim and S Y Sohn ldquoStock fraud detection using peer groupanalysisrdquo Expert Systems with Applications vol 39 no 10 pp8986ndash8992 2012

[39] ldquoOurmon network monitoring and anomaly detection systemrdquoApril 2013 httpourmonsourceforgenet

[40] B Babaki T Guns and S Nijssen ldquoConstrained ClusteringUsing Column Generationrdquo in Integration of AI and OR Tech-niques in Constraint Programming vol 8451 of Lecture Notes inComputer Science pp 438ndash454 Springer International Publish-ing Cham 2014

[41] A Abraham R Andriatsimandefitra A Brunelat J-F Lalandeand V Viet Triem Tong ldquoGroddDroid A gorilla for triggeringmalicious behaviorsrdquo in Proceedings of the 10th InternationalConference on Malicious and Unwanted Software MALWARE2015 pp 119ndash127 USA October 2015

[42] Y Li Z Yang Y Guo and X Chen ldquoDroidBot A lightweightUI-guided test input generator for androidrdquo inProceedings of the39th IEEEACM International Conference on Software Engineer-ing Companion ICSE-C 2017 pp 23ndash26 Argentina May 2017

[43] A Aizawa ldquoAn information-theoretic perspective of tf-idfmeasuresrdquo Information Processing amp Management vol 39 no1 pp 45ndash65 2003

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 11: AppFA: A Novel Approach to Detect Malicious …downloads.hindawi.com/journals/scn/2018/2854728.pdfSecurityandCommunicationNetworks F :Exampleofkey-valuepairs. signs, and the packet

Security and Communication Networks 11

Table 5 Data summary

Traffic set Apps Quantity Number of collectedflows (TCP and UPD) Amount of data (bytes)

TrafficSet 1 Selected malicious apps(for testing accuracy of traffic clustering and detection rates) 93 223220 538G

TrafficSet 2 Benign apps(for testing false positive rates of AppFA) 100 305103 640G

TrafficSet 3 All malicious apps(for malicious apps detection in local networks) 1260 asymp31 million 1872G

Table 6 Experimental results of traffic clustering

Number of HTTP flows forgenerating signature seeds

Number of flowscorrectly clustered

Number ofmisclustered

flows2 201357 218633 218946 42744 221384 18365 222619 601

k1 = k2 = 1

k1 = k2 = 4

k1 = k2 = 7

00005

0001

00015

0002

00025

0003

00035

0004

False

pos

itive

rate

06 08 1 12 14 16 18 2 04ts

Figure 9 False positive rates of AppFA with different values of 119905119904and 119896119894(119894 = 1 2)

malicious apps are correctly detected However when 119901 lt 97(119901 = 95 and 96 in Figure 12) the detection rates are almost thesame namely 97Therefore AppFAwill be competent witha few unknown apps when carrying out app traffic clustering

Finally we evaluate AppFA with the remaining maliciousapps besides the selected 93 samplesThe rest of themaliciousapps contain repackaged and other types of malwares Thesemalwares are found to have some errors when run on ourphones However they also produced considerable networktraffic We test AppFA with these apps for examining thegeneral capability of AppFA The parameters are set up asfollows 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 The experimentalresults are shown in Table 7 The detection rate is 734and the false positive rate is 35 Compared to the results

1 3 5 7 10g

Detection rate

0

01

02

03

04

05

06

07

08

09

1

rate

()

Figure 10 Detection rates of AppFA with different values of 119892

1 3 5 7 10g

False positive rate

times10minus3

0

1

2

3

4

5

6

7

rate

()

Figure 11 False positive rates of AppFA with different values of 119892

displayed in Figures 8 9 10 and 11 the detection rate issignificantly reduced and the false positive rate is increasedThe possible reasons will be discussed in Section 64

Note that AppFA is designed to perform nearly real-timemalicious apps detection thus efficiency is also a big concernTable 8 presents the computational performance for major

12 Security and Communication Networks

99989695value of p

0

01

02

03

04

05

06

07

08

09

1

dete

ctio

n ra

te (

)

Figure 12 Detection rates of AppFA with different values of 119901

Table 7 Experimental results for the malicious apps besides theselected 93 samples

Detection rate False positive rate734 35

Table 8 The computational performance

Procedures TimeConstrained flow clustering (Algorithm 1) 3minApp identification (Algorithm 2) 52minNetwork behavior profile construction 17minPeer group analysis 45 sec

steps in terms of average running time using the experimentaldata Generally the proposedmethod can be deployed onlineto detect malicious apps Specifically it is fast typically lessthan one minute to exploit peer group analysis to performmalicious app detection The app identification procedureis the most time-consuming step because the signaturematching and constrained flow clustering are carried out oneach flow Parallel computing such as cloud resources can befurther used to speed the malicious apps detection and scaleup to more Internet traffic data

Detection of Malicious Repackaged Apps for Encrypted TrafficAs shown in Table 4 AppFA does not consider traffic contentand mainly uses traffic statistical features to detect maliciousAndroid apps Thus AppFA can deal with encrypted trafficSince in MalGenome there is no app whose all connectionsto attack servers are encrypted we have created a newAndroid repackaged malware that communicates with allmalicious servers by encrypted connections In detail wefirst modify AnserverBotrsquos source code and make sure itsnetwork connectionswill be encrypted byTLS protocolThenwe graft it into a popular game app aircomaceviralmotox3Finally we collect network traffic of the repackaged aircomaceviralmotox3 and its peer group In the experiments

Table 9 Experimental results when searching similar apps locally

Detection rate False positive rate692 47

the selected peer group of the repackaged aircomaceviralmotox3 is comwordmobilesbikeRacing comtopfreegamesbikeracefreeworld comskgamestrafficrider comtomicowheel-iechallenge and madmotoracingbike We repeat to detectthe repackaged aircomaceviralmotox3 10 times and thedetection rate is 100 The results show that AppFA canhandle encrypted network traffic

63 Detection in Local Networks In the above experimentsthe peer group for each app is determinedwith the help of appstore such as theGoogle PlayThatmeansAppFAhas to accessthe Internet when performing malicious apps detection Infact AppFA can also be enhanced to choose the appsrsquo peergroups locally namely from the set of already identified appsNote that AppFA uses signature matching to identify appson the network and the identified ones can be treated as amini app store By this AppFA can work on local networksTo find out similar apps from the already identified ones anefficient method based on information retrieval technologiesis proposed and is shown in Algorithm 3

The basic idea of Algorithm 3 is that similar apps mayhave similar functionalities and the context (keywords) maybe similarTherefore we useweb-page searching technologiesto match similar app The 119872119886119905119888ℎ119894119899119892 function in line (4) canbe realized by TF-IDF (Term Frequency-Inverse DocumentFrequency) [43] Again peer groups can be chose by (8) withthe results returned by Algorithm 3

We have implemented the local similar app searching andevaluated AppFArsquos performance in local networks All appsincluding all malicious and benign apps are considered asidentified apps and Algorithm 3 is carried out to find outsimilar apps When 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 for the93 typical repackaged malwares the experimental results arelisted in Table 9 The detection rate is 692 lower than 97in Figure 8The experimental results indicate that AppFA canwork in local networks and the proper peer groups do affectthe final detection significantly

64 Discussion Figures 8 9 10 and 11 show that AppFAgets high detection rates and low false positive rates whendetecting repackaged apps However the detection rate maybecome lower when it is applied to other types of maliciousapps such as sending SMS without notification This may bedue to the fact that other types of mobile malwares transmitless data through network than repackaged apps Note thatthe detection features used inAppFA aremainly related to thepackets sizes and the number of flows So the detection ratemay be declined if the appsrsquo network behaviors are slightlychanged

Compared to the latest work done in [6] AppFA has aslightly lower detection rate (the detection rate is 95ndash999 asreported in [6]) However Garg et alrsquos method [6] consideredonly 18 malware apps and 14 genuine apps and assumed

Security and Communication Networks 13

(1) extract all plaintext keywords from HTTP headers of thetested app 119860 Denote the set of keywords as 119882(119860)

(2) extract all plaintext keywords from HTTP headers of otheridentified apps Denote the set of keywords of identifiedapp 119894 as 119882(119894) 119894 = 0 1 119899 minus 1

(3) while 0 le 119899 do(4) 119878119888119900119903119890119904[119894]V119886119897119906119890 = 119872119886119905119888ℎ119894119899119892(119882(119860) 119882(119894))(5) 119878119888119900119903119890119904[119894]119886119901119901 = 119894(6) 119894 + +(7) end while(8) sorting 119878119888119900119903119890119904[] with 119878119888119900119903119890119904[119894]V119886119897119906119890 by decreasing order(9) return the first 119898119878119888119900119903119890119904[119894]119886119901119901 119894 = 0 1 119898 minus 1

Algorithm 3 Searching similar apps locally

Table 10 Experimental results when the fact that what traffic isgenerated by what app is exactly known In the experiments 1198961 =1198962 = 4 119905119904 = 2 and 119892 = 5

Selected maliciousapps

Remaining maliciousapps

Detection rate 983 85False positiverate 003 27

that the traffic generated by apps was already known (theirdata were collected on the mobile device so their methodis actually a kind of client-side approach) In our workmuch larger samples are taken into account and constrainedclustering is exploited to determine to what app the flowsbelong The accuracy of clustering is also a factor affectingthe detection rate To confirm this we further take the sameassumption as Garg et alrsquos method [6] namely we assumethe accuracy of Algorithm 1 is 100 and know exactly whattraffic is generated by what app (in our experiments we useapp Packet Capture to obtain perfect ground truth of whatflows came from what app) At this time the experimentalresults are given in Table 10 Apparently the detection ratesare much improved

In practice one can deploy both our method and othermethods such as [6 15] simultaneously As indicated by ourexperiments AppFA is efficient in detecting repackaged andself-updatingmalicious apps In fact [15] is suitable forHTTPanalysis and [6] can detect CampC communication efficientlySo our work can be a complement of existing work onmalicious app detection

In this work AppFA mainly uses the statistical features(refer to Table 4) of network traffic to detect Android mali-cious repackaged applications Therefore attackers (repack-aged apps) may change the characteristics of their traffic toremain undetectable But it is quite difficult in practice Asillustrated in Section 2 repackaged apps usually introduceadditional network traffic thus the attackers must removesome normal network connections to keep network behav-iors the same to evade detection However the removal of

normal network connections will impact the functionalitiesof apps and may cause errors This gives additional clues todetect repackaged malware Meanwhile the disappearanceof some network connections may be abnormal as wellTherefore our proposed method is hard to evade

7 Conclusion

In this paper we propose a novel approach AppFA to detectmalicious apps at the network level In AppFA apps are firstidentified from network traffic by signature matching andconstrained clustering Then Kernel Principal ComponentAnalysis is employed to construct app network behavior pro-file and distinguish minor traffic variations from significantdifferences At last we take advantage of peer group analysisto detect malicious apps to avoid time-consuming offlinemodel training Notably AppFA does not need to installprograms or modify operating systems to collect featureinformation Thus it is very convenient to be used and thecost is low The experimental results show that AppFA candetect Android repackaged malware with the detection ratehigher than 90 and a false positive rate lower than 04

The appsrsquo network behaviorsmay be significantly changedby version update and thus cause false positives This needsto be further investigated In the future work AppFA willbe extended to include more network traffic features anddetect more types of malicious apps We will also makeAppFA an open-source project to facilitate further researchesof malicious app detection

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by National Natural Science Founda-tion of China under Grants 61702282 and 61502250 NaturalScience Foundation of the Jiangsu Higher Education Insti-tutions of China under Grant 17KJB520023 NUPTSF underGrant NY217143 and Nanjing Forestry University (GXL016CX2016026)

14 Security and Communication Networks

References

[1] Android statistics Number of android applications September2017 httpwwwappbraincomstatsnumber-of-android-apps

[2] A Mylonas M Theoharidou and D Gritzalis ldquoAssessingPrivacy Risks in Android A User-Centric Approachrdquo in RiskAssessment and Risk-Driven Testing Lecture Notes in ComputerScience pp 21ndash37 Springer International Publishing Cham2014

[3] Does Does your mobile anti-virus app protect or infect you thetruth behind du antivirus security September 2017 httpsresearchcheckpointcommobile-anti-virus-app-protect-infect-truth-behind-du-antivirus-security

[4] A P Felt M Finifter E Chin S Hanna and D Wagner ldquoAsurvey of mobile malware in the wildrdquo in Proceedings of the 1stACM Workshop on Security and Privacy in Smartphones andMobile Devices (SPSM rsquo11) held in Association with the 18th ACMConference on Computer and Communications Security (CCSrsquo11) pp 3ndash14 October 2011

[5] K Tam A Feizollah N B Anuar R Salleh and L Caval-laro ldquoThe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys vol 49 no 4 Article ID3017427 2017

[6] S Garg S K Peddoju and A K Sarje ldquoNetwork-based detec-tion of Android malicious appsrdquo International Journal of Infor-mation Security pp 1ndash16 2016

[7] A Shabtai L Tenenboim-Chekina D Mimran L Rokach BShapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquo Com-puters amp Security vol 43 pp 1ndash18 2014

[8] L Xie X Zhang J-P Seifert and S Zhu ldquoPBMDS a behavior-based malware detection system for cellphone devicesrdquo inProceedings of the 3rd ACM Conference on Wireless NetworkSecurity (WiSec rsquo10) pp 37ndash48 Hoboken NJ USAMarch 2010

[9] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacy inSmartphones and Mobile Devices (SPSM rsquo11) Held in Associationwith the 18th ACM Conference on Computer and Communica-tions Security (CCS rsquo11) pp 15ndash25 October 2011

[10] GDini FMartinelli A Saracino andD Sgandurra ldquoMADAMa multi-level anomaly detector for android malwarerdquo in Com-puter Network Security vol 7531 of Lecture Notes in ComputerScience pp 240ndash253 Springer Berlin Germany 2012

[11] Y Zhang M Yang B Xu et al ldquoVetting undesirable behaviorsin android apps with permission use analysisrdquo in Proceedingsof the in Proceedings of the 2013 ACM SIGSAC conference onComputer amp communications security pp 611ndash622 2013

[12] S Wang Z Chen X Li L Wang K Ji and C Zhao ldquoAndroidMalware Clustering Analysis on Network-Level Behaviorrdquo inIntelligent Computing Theories and Application vol 10361 ofLecture Notes in Computer Science pp 796ndash807 SpringerInternational Publishing Cham Switzerland 2017

[13] A S Shamili C Bauckhage and T Alpcan ldquoMalware detectionon mobile devices using distributed machine learningrdquo inProceedings of the 20th International Conference on PatternRecognition (ICPR rsquo10) pp 4348ndash4351 August 2010

[14] Y Wang J Wei and K Vangury ldquoBring your own device secu-rity issues and challengesrdquo in Proceedings of the 2014 IEEE 11thConsumer Communications and Networking Conference CCNC2014 pp 80ndash85 USA January 2014

[15] P S Chen S-C Lin and C-H Sun ldquoSimple and effectivemethod for detecting abnormal internet behaviors of mobiledevicesrdquo Information Sciences vol 321 Article ID 11536 pp 193ndash204 2015

[16] Global mobile os market share in sales to end users from 1stquarter 2009 to 1st quarter 2017 2017 httpswwwstatistacomstatistics266136global-market-share-held-by-smartphone-op-eratingsystems

[17] ldquoIt threat evolution q3 2017 statisticsrdquo November 2017 httpssecurelistcomit-threat-evolution-q3-2017-statistics83131

[18] L Li D Li T F Bissyande et al ldquoUnderstanding Android AppPiggybacking A Systematic Study of Malicious Code GraftingrdquoIEEE Transactions on Information Forensics and Security vol 12no 6 pp 1269ndash1284 2017

[19] Y Zhou and X Jiang ldquoDissecting android malware characteri-zation and evolutionrdquo in Proceedings of the 33rd IEEE Sympo-sium on Security and Privacy pp 95ndash109 San Francisco CalifUSA May 2012

[20] P Faruki V Laxmi A Bharmal M S Gaur and V Gan-moor ldquoAndroSimilar Robust signature for detecting variantsof Android malwarerdquo Journal of Information Security andApplications vol 22 pp 66ndash80 2015

[21] Y Zhou and X JiangAn analysis of the anserverbot trojan 2011httpwwwcscncsuedufacultyjiangpubsAnserverBot Ana-lysispdf

[22] P Faruki A Bharmal V Laxmi et al ldquoAndroid security a sur-vey of issues malware penetration and defensesrdquo IEEE Com-munications Surveys amp Tutorials vol 17 no 2 pp 998ndash10222015

[23] P Yan and Z Yan ldquoA survey on dynamic mobile malwaredetectionrdquo Software Quality Journal pp 1ndash29 2017

[24] S Dai Y Liu T Wang T Wei and W Zou ldquoBehavior-basedmalware detection on mobile phonerdquo in Proceedings of the 20106th International Conference on Wireless Communications Net-working and Mobile Computing WiCOM 2010 China Septem-ber 2010

[25] D Damopoulos G Kambourakis S Gritzalis and S O ParkldquoExposing mobile malware from the inside (or what is yourmobile app really doing)rdquo Peer-to-Peer Networking and Appli-cations vol 7 no 4 pp 687ndash697 2014

[26] D Damopoulos G Kambourakis and G Portokalidis ldquoThebest of both worlds A framework for the synergistic operationof host and cloud anomaly-based IDS for smartphonesrdquo inProceedings of the 7th European Workshop on System SecurityEuroSec 2014 Netherlands April 2014

[27] Q Xu J Erman A Gerber Z Mao J Pang and S Venkatara-man ldquoIdentifying diverse usage behaviors of smartphone appsrdquoin Proceedings of the ACM SIGCOMM Internet MeasurementConference (IMC rsquo11) pp 329ndash344 November 2011

[28] S Dai A Tongaonkar X Wang A Nucci and D Song ldquoNet-workProfiler Towards automatic fingerprinting of Androidappsrdquo in Proceedings of the 32nd IEEE Conference on ComputerCommunications IEEE INFOCOM 2013 pp 809ndash817 ItalyApril 2013

[29] Q Xu T Andrews Y Liao et al ldquoFlowr A self-learning systemfor classifying mobile application trafficrdquo in Proceedings of the2014 ACM SIGMETRICS International Conference on Measure-ment and Modeling of Computer Systems SIGMETRICS 2014pp 569-570 usa June 2014

[30] Q Xu Y Liao S Miskovic et al ldquoAutomatic generation ofmobile app signatures from traffic observationsrdquo in Proceedings

Security and Communication Networks 15

of the 34th IEEE Annual Conference on Computer Communica-tions and Networks IEEE INFOCOM 2015 pp 1481ndash1489 HongKong May 2015

[31] J Sun L She H Chen et al ldquoAutomatically identifying appsin mobile trafficrdquo Concurrency Computation vol 28 no 14 pp3927ndash3941 2016

[32] G He B Xu and H Zhu ldquoIdentifying Mobile Applications forEncrypted Network Trafficrdquo in Proceedings of the 5th Interna-tional Conference on Advanced Cloud and Big Data CBD 2017pp 279ndash284 China August 2017

[33] G He M Yang X Gu J Luo and Y Ma ldquoA novel activewebsite fingerprinting attack against Tor anonymous systemrdquoin Proceedings of the 2014 18th IEEE International Conference onComputer Supported CooperativeWork inDesign CSCWD2014pp 112ndash117 Taiwan May 2014

[34] G He M Yang J Luo and X Gu ldquoA novel application classi-fication attack against Torrdquo Concurrency and ComputationPractice and Experience vol 27 no 18 pp 5640ndash5661 2015

[35] B Miller L Huang A D Joseph and J D Tygar ldquoI know whyyou went to the clinic Risks and realization of HTTPS trafficanalysisrdquo Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes in Bio-informatics) Preface vol 8555 pp 143ndash163 2014

[36] Y Wang Y Xiang J Zhang W Zhou G Wei and L T YangldquoInternet traffic classification using constrained clusteringrdquoIEEE Transactions on Parallel and Distributed Systems vol 25no 11 pp 2932ndash2943 2014

[37] L J Cao K S Chua W K Chong H P Lee and Q M Gu ldquoAcomparison of PCA KPCA and ICA for dimensionality reduc-tion in support vector machinerdquo Neurocomputing vol 55 no1-2 pp 321ndash336 2003

[38] Y Kim and S Y Sohn ldquoStock fraud detection using peer groupanalysisrdquo Expert Systems with Applications vol 39 no 10 pp8986ndash8992 2012

[39] ldquoOurmon network monitoring and anomaly detection systemrdquoApril 2013 httpourmonsourceforgenet

[40] B Babaki T Guns and S Nijssen ldquoConstrained ClusteringUsing Column Generationrdquo in Integration of AI and OR Tech-niques in Constraint Programming vol 8451 of Lecture Notes inComputer Science pp 438ndash454 Springer International Publish-ing Cham 2014

[41] A Abraham R Andriatsimandefitra A Brunelat J-F Lalandeand V Viet Triem Tong ldquoGroddDroid A gorilla for triggeringmalicious behaviorsrdquo in Proceedings of the 10th InternationalConference on Malicious and Unwanted Software MALWARE2015 pp 119ndash127 USA October 2015

[42] Y Li Z Yang Y Guo and X Chen ldquoDroidBot A lightweightUI-guided test input generator for androidrdquo inProceedings of the39th IEEEACM International Conference on Software Engineer-ing Companion ICSE-C 2017 pp 23ndash26 Argentina May 2017

[43] A Aizawa ldquoAn information-theoretic perspective of tf-idfmeasuresrdquo Information Processing amp Management vol 39 no1 pp 45ndash65 2003

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 12: AppFA: A Novel Approach to Detect Malicious …downloads.hindawi.com/journals/scn/2018/2854728.pdfSecurityandCommunicationNetworks F :Exampleofkey-valuepairs. signs, and the packet

12 Security and Communication Networks

99989695value of p

0

01

02

03

04

05

06

07

08

09

1

dete

ctio

n ra

te (

)

Figure 12 Detection rates of AppFA with different values of 119901

Table 7 Experimental results for the malicious apps besides theselected 93 samples

Detection rate False positive rate734 35

Table 8 The computational performance

Procedures TimeConstrained flow clustering (Algorithm 1) 3minApp identification (Algorithm 2) 52minNetwork behavior profile construction 17minPeer group analysis 45 sec

steps in terms of average running time using the experimentaldata Generally the proposedmethod can be deployed onlineto detect malicious apps Specifically it is fast typically lessthan one minute to exploit peer group analysis to performmalicious app detection The app identification procedureis the most time-consuming step because the signaturematching and constrained flow clustering are carried out oneach flow Parallel computing such as cloud resources can befurther used to speed the malicious apps detection and scaleup to more Internet traffic data

Detection of Malicious Repackaged Apps for Encrypted TrafficAs shown in Table 4 AppFA does not consider traffic contentand mainly uses traffic statistical features to detect maliciousAndroid apps Thus AppFA can deal with encrypted trafficSince in MalGenome there is no app whose all connectionsto attack servers are encrypted we have created a newAndroid repackaged malware that communicates with allmalicious servers by encrypted connections In detail wefirst modify AnserverBotrsquos source code and make sure itsnetwork connectionswill be encrypted byTLS protocolThenwe graft it into a popular game app aircomaceviralmotox3Finally we collect network traffic of the repackaged aircomaceviralmotox3 and its peer group In the experiments

Table 9 Experimental results when searching similar apps locally

Detection rate False positive rate692 47

the selected peer group of the repackaged aircomaceviralmotox3 is comwordmobilesbikeRacing comtopfreegamesbikeracefreeworld comskgamestrafficrider comtomicowheel-iechallenge and madmotoracingbike We repeat to detectthe repackaged aircomaceviralmotox3 10 times and thedetection rate is 100 The results show that AppFA canhandle encrypted network traffic

63 Detection in Local Networks In the above experimentsthe peer group for each app is determinedwith the help of appstore such as theGoogle PlayThatmeansAppFAhas to accessthe Internet when performing malicious apps detection Infact AppFA can also be enhanced to choose the appsrsquo peergroups locally namely from the set of already identified appsNote that AppFA uses signature matching to identify appson the network and the identified ones can be treated as amini app store By this AppFA can work on local networksTo find out similar apps from the already identified ones anefficient method based on information retrieval technologiesis proposed and is shown in Algorithm 3

The basic idea of Algorithm 3 is that similar apps mayhave similar functionalities and the context (keywords) maybe similarTherefore we useweb-page searching technologiesto match similar app The 119872119886119905119888ℎ119894119899119892 function in line (4) canbe realized by TF-IDF (Term Frequency-Inverse DocumentFrequency) [43] Again peer groups can be chose by (8) withthe results returned by Algorithm 3

We have implemented the local similar app searching andevaluated AppFArsquos performance in local networks All appsincluding all malicious and benign apps are considered asidentified apps and Algorithm 3 is carried out to find outsimilar apps When 1198961 = 1198962 = 4 119905119904 = 2 and 119892 = 5 for the93 typical repackaged malwares the experimental results arelisted in Table 9 The detection rate is 692 lower than 97in Figure 8The experimental results indicate that AppFA canwork in local networks and the proper peer groups do affectthe final detection significantly

64 Discussion Figures 8 9 10 and 11 show that AppFAgets high detection rates and low false positive rates whendetecting repackaged apps However the detection rate maybecome lower when it is applied to other types of maliciousapps such as sending SMS without notification This may bedue to the fact that other types of mobile malwares transmitless data through network than repackaged apps Note thatthe detection features used inAppFA aremainly related to thepackets sizes and the number of flows So the detection ratemay be declined if the appsrsquo network behaviors are slightlychanged

Compared to the latest work done in [6] AppFA has aslightly lower detection rate (the detection rate is 95ndash999 asreported in [6]) However Garg et alrsquos method [6] consideredonly 18 malware apps and 14 genuine apps and assumed

Security and Communication Networks 13

(1) extract all plaintext keywords from HTTP headers of thetested app 119860 Denote the set of keywords as 119882(119860)

(2) extract all plaintext keywords from HTTP headers of otheridentified apps Denote the set of keywords of identifiedapp 119894 as 119882(119894) 119894 = 0 1 119899 minus 1

(3) while 0 le 119899 do(4) 119878119888119900119903119890119904[119894]V119886119897119906119890 = 119872119886119905119888ℎ119894119899119892(119882(119860) 119882(119894))(5) 119878119888119900119903119890119904[119894]119886119901119901 = 119894(6) 119894 + +(7) end while(8) sorting 119878119888119900119903119890119904[] with 119878119888119900119903119890119904[119894]V119886119897119906119890 by decreasing order(9) return the first 119898119878119888119900119903119890119904[119894]119886119901119901 119894 = 0 1 119898 minus 1

Algorithm 3 Searching similar apps locally

Table 10 Experimental results when the fact that what traffic isgenerated by what app is exactly known In the experiments 1198961 =1198962 = 4 119905119904 = 2 and 119892 = 5

Selected maliciousapps

Remaining maliciousapps

Detection rate 983 85False positiverate 003 27

that the traffic generated by apps was already known (theirdata were collected on the mobile device so their methodis actually a kind of client-side approach) In our workmuch larger samples are taken into account and constrainedclustering is exploited to determine to what app the flowsbelong The accuracy of clustering is also a factor affectingthe detection rate To confirm this we further take the sameassumption as Garg et alrsquos method [6] namely we assumethe accuracy of Algorithm 1 is 100 and know exactly whattraffic is generated by what app (in our experiments we useapp Packet Capture to obtain perfect ground truth of whatflows came from what app) At this time the experimentalresults are given in Table 10 Apparently the detection ratesare much improved

In practice one can deploy both our method and othermethods such as [6 15] simultaneously As indicated by ourexperiments AppFA is efficient in detecting repackaged andself-updatingmalicious apps In fact [15] is suitable forHTTPanalysis and [6] can detect CampC communication efficientlySo our work can be a complement of existing work onmalicious app detection

In this work AppFA mainly uses the statistical features(refer to Table 4) of network traffic to detect Android mali-cious repackaged applications Therefore attackers (repack-aged apps) may change the characteristics of their traffic toremain undetectable But it is quite difficult in practice Asillustrated in Section 2 repackaged apps usually introduceadditional network traffic thus the attackers must removesome normal network connections to keep network behav-iors the same to evade detection However the removal of

normal network connections will impact the functionalitiesof apps and may cause errors This gives additional clues todetect repackaged malware Meanwhile the disappearanceof some network connections may be abnormal as wellTherefore our proposed method is hard to evade

7 Conclusion

In this paper we propose a novel approach AppFA to detectmalicious apps at the network level In AppFA apps are firstidentified from network traffic by signature matching andconstrained clustering Then Kernel Principal ComponentAnalysis is employed to construct app network behavior pro-file and distinguish minor traffic variations from significantdifferences At last we take advantage of peer group analysisto detect malicious apps to avoid time-consuming offlinemodel training Notably AppFA does not need to installprograms or modify operating systems to collect featureinformation Thus it is very convenient to be used and thecost is low The experimental results show that AppFA candetect Android repackaged malware with the detection ratehigher than 90 and a false positive rate lower than 04

The appsrsquo network behaviorsmay be significantly changedby version update and thus cause false positives This needsto be further investigated In the future work AppFA willbe extended to include more network traffic features anddetect more types of malicious apps We will also makeAppFA an open-source project to facilitate further researchesof malicious app detection

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by National Natural Science Founda-tion of China under Grants 61702282 and 61502250 NaturalScience Foundation of the Jiangsu Higher Education Insti-tutions of China under Grant 17KJB520023 NUPTSF underGrant NY217143 and Nanjing Forestry University (GXL016CX2016026)

14 Security and Communication Networks

References

[1] Android statistics Number of android applications September2017 httpwwwappbraincomstatsnumber-of-android-apps

[2] A Mylonas M Theoharidou and D Gritzalis ldquoAssessingPrivacy Risks in Android A User-Centric Approachrdquo in RiskAssessment and Risk-Driven Testing Lecture Notes in ComputerScience pp 21ndash37 Springer International Publishing Cham2014

[3] Does Does your mobile anti-virus app protect or infect you thetruth behind du antivirus security September 2017 httpsresearchcheckpointcommobile-anti-virus-app-protect-infect-truth-behind-du-antivirus-security

[4] A P Felt M Finifter E Chin S Hanna and D Wagner ldquoAsurvey of mobile malware in the wildrdquo in Proceedings of the 1stACM Workshop on Security and Privacy in Smartphones andMobile Devices (SPSM rsquo11) held in Association with the 18th ACMConference on Computer and Communications Security (CCSrsquo11) pp 3ndash14 October 2011

[5] K Tam A Feizollah N B Anuar R Salleh and L Caval-laro ldquoThe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys vol 49 no 4 Article ID3017427 2017

[6] S Garg S K Peddoju and A K Sarje ldquoNetwork-based detec-tion of Android malicious appsrdquo International Journal of Infor-mation Security pp 1ndash16 2016

[7] A Shabtai L Tenenboim-Chekina D Mimran L Rokach BShapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquo Com-puters amp Security vol 43 pp 1ndash18 2014

[8] L Xie X Zhang J-P Seifert and S Zhu ldquoPBMDS a behavior-based malware detection system for cellphone devicesrdquo inProceedings of the 3rd ACM Conference on Wireless NetworkSecurity (WiSec rsquo10) pp 37ndash48 Hoboken NJ USAMarch 2010

[9] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacy inSmartphones and Mobile Devices (SPSM rsquo11) Held in Associationwith the 18th ACM Conference on Computer and Communica-tions Security (CCS rsquo11) pp 15ndash25 October 2011

[10] GDini FMartinelli A Saracino andD Sgandurra ldquoMADAMa multi-level anomaly detector for android malwarerdquo in Com-puter Network Security vol 7531 of Lecture Notes in ComputerScience pp 240ndash253 Springer Berlin Germany 2012

[11] Y Zhang M Yang B Xu et al ldquoVetting undesirable behaviorsin android apps with permission use analysisrdquo in Proceedingsof the in Proceedings of the 2013 ACM SIGSAC conference onComputer amp communications security pp 611ndash622 2013

[12] S Wang Z Chen X Li L Wang K Ji and C Zhao ldquoAndroidMalware Clustering Analysis on Network-Level Behaviorrdquo inIntelligent Computing Theories and Application vol 10361 ofLecture Notes in Computer Science pp 796ndash807 SpringerInternational Publishing Cham Switzerland 2017

[13] A S Shamili C Bauckhage and T Alpcan ldquoMalware detectionon mobile devices using distributed machine learningrdquo inProceedings of the 20th International Conference on PatternRecognition (ICPR rsquo10) pp 4348ndash4351 August 2010

[14] Y Wang J Wei and K Vangury ldquoBring your own device secu-rity issues and challengesrdquo in Proceedings of the 2014 IEEE 11thConsumer Communications and Networking Conference CCNC2014 pp 80ndash85 USA January 2014

[15] P S Chen S-C Lin and C-H Sun ldquoSimple and effectivemethod for detecting abnormal internet behaviors of mobiledevicesrdquo Information Sciences vol 321 Article ID 11536 pp 193ndash204 2015

[16] Global mobile os market share in sales to end users from 1stquarter 2009 to 1st quarter 2017 2017 httpswwwstatistacomstatistics266136global-market-share-held-by-smartphone-op-eratingsystems

[17] ldquoIt threat evolution q3 2017 statisticsrdquo November 2017 httpssecurelistcomit-threat-evolution-q3-2017-statistics83131

[18] L Li D Li T F Bissyande et al ldquoUnderstanding Android AppPiggybacking A Systematic Study of Malicious Code GraftingrdquoIEEE Transactions on Information Forensics and Security vol 12no 6 pp 1269ndash1284 2017

[19] Y Zhou and X Jiang ldquoDissecting android malware characteri-zation and evolutionrdquo in Proceedings of the 33rd IEEE Sympo-sium on Security and Privacy pp 95ndash109 San Francisco CalifUSA May 2012

[20] P Faruki V Laxmi A Bharmal M S Gaur and V Gan-moor ldquoAndroSimilar Robust signature for detecting variantsof Android malwarerdquo Journal of Information Security andApplications vol 22 pp 66ndash80 2015

[21] Y Zhou and X JiangAn analysis of the anserverbot trojan 2011httpwwwcscncsuedufacultyjiangpubsAnserverBot Ana-lysispdf

[22] P Faruki A Bharmal V Laxmi et al ldquoAndroid security a sur-vey of issues malware penetration and defensesrdquo IEEE Com-munications Surveys amp Tutorials vol 17 no 2 pp 998ndash10222015

[23] P Yan and Z Yan ldquoA survey on dynamic mobile malwaredetectionrdquo Software Quality Journal pp 1ndash29 2017

[24] S Dai Y Liu T Wang T Wei and W Zou ldquoBehavior-basedmalware detection on mobile phonerdquo in Proceedings of the 20106th International Conference on Wireless Communications Net-working and Mobile Computing WiCOM 2010 China Septem-ber 2010

[25] D Damopoulos G Kambourakis S Gritzalis and S O ParkldquoExposing mobile malware from the inside (or what is yourmobile app really doing)rdquo Peer-to-Peer Networking and Appli-cations vol 7 no 4 pp 687ndash697 2014

[26] D Damopoulos G Kambourakis and G Portokalidis ldquoThebest of both worlds A framework for the synergistic operationof host and cloud anomaly-based IDS for smartphonesrdquo inProceedings of the 7th European Workshop on System SecurityEuroSec 2014 Netherlands April 2014

[27] Q Xu J Erman A Gerber Z Mao J Pang and S Venkatara-man ldquoIdentifying diverse usage behaviors of smartphone appsrdquoin Proceedings of the ACM SIGCOMM Internet MeasurementConference (IMC rsquo11) pp 329ndash344 November 2011

[28] S Dai A Tongaonkar X Wang A Nucci and D Song ldquoNet-workProfiler Towards automatic fingerprinting of Androidappsrdquo in Proceedings of the 32nd IEEE Conference on ComputerCommunications IEEE INFOCOM 2013 pp 809ndash817 ItalyApril 2013

[29] Q Xu T Andrews Y Liao et al ldquoFlowr A self-learning systemfor classifying mobile application trafficrdquo in Proceedings of the2014 ACM SIGMETRICS International Conference on Measure-ment and Modeling of Computer Systems SIGMETRICS 2014pp 569-570 usa June 2014

[30] Q Xu Y Liao S Miskovic et al ldquoAutomatic generation ofmobile app signatures from traffic observationsrdquo in Proceedings

Security and Communication Networks 15

of the 34th IEEE Annual Conference on Computer Communica-tions and Networks IEEE INFOCOM 2015 pp 1481ndash1489 HongKong May 2015

[31] J Sun L She H Chen et al ldquoAutomatically identifying appsin mobile trafficrdquo Concurrency Computation vol 28 no 14 pp3927ndash3941 2016

[32] G He B Xu and H Zhu ldquoIdentifying Mobile Applications forEncrypted Network Trafficrdquo in Proceedings of the 5th Interna-tional Conference on Advanced Cloud and Big Data CBD 2017pp 279ndash284 China August 2017

[33] G He M Yang X Gu J Luo and Y Ma ldquoA novel activewebsite fingerprinting attack against Tor anonymous systemrdquoin Proceedings of the 2014 18th IEEE International Conference onComputer Supported CooperativeWork inDesign CSCWD2014pp 112ndash117 Taiwan May 2014

[34] G He M Yang J Luo and X Gu ldquoA novel application classi-fication attack against Torrdquo Concurrency and ComputationPractice and Experience vol 27 no 18 pp 5640ndash5661 2015

[35] B Miller L Huang A D Joseph and J D Tygar ldquoI know whyyou went to the clinic Risks and realization of HTTPS trafficanalysisrdquo Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes in Bio-informatics) Preface vol 8555 pp 143ndash163 2014

[36] Y Wang Y Xiang J Zhang W Zhou G Wei and L T YangldquoInternet traffic classification using constrained clusteringrdquoIEEE Transactions on Parallel and Distributed Systems vol 25no 11 pp 2932ndash2943 2014

[37] L J Cao K S Chua W K Chong H P Lee and Q M Gu ldquoAcomparison of PCA KPCA and ICA for dimensionality reduc-tion in support vector machinerdquo Neurocomputing vol 55 no1-2 pp 321ndash336 2003

[38] Y Kim and S Y Sohn ldquoStock fraud detection using peer groupanalysisrdquo Expert Systems with Applications vol 39 no 10 pp8986ndash8992 2012

[39] ldquoOurmon network monitoring and anomaly detection systemrdquoApril 2013 httpourmonsourceforgenet

[40] B Babaki T Guns and S Nijssen ldquoConstrained ClusteringUsing Column Generationrdquo in Integration of AI and OR Tech-niques in Constraint Programming vol 8451 of Lecture Notes inComputer Science pp 438ndash454 Springer International Publish-ing Cham 2014

[41] A Abraham R Andriatsimandefitra A Brunelat J-F Lalandeand V Viet Triem Tong ldquoGroddDroid A gorilla for triggeringmalicious behaviorsrdquo in Proceedings of the 10th InternationalConference on Malicious and Unwanted Software MALWARE2015 pp 119ndash127 USA October 2015

[42] Y Li Z Yang Y Guo and X Chen ldquoDroidBot A lightweightUI-guided test input generator for androidrdquo inProceedings of the39th IEEEACM International Conference on Software Engineer-ing Companion ICSE-C 2017 pp 23ndash26 Argentina May 2017

[43] A Aizawa ldquoAn information-theoretic perspective of tf-idfmeasuresrdquo Information Processing amp Management vol 39 no1 pp 45ndash65 2003

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 13: AppFA: A Novel Approach to Detect Malicious …downloads.hindawi.com/journals/scn/2018/2854728.pdfSecurityandCommunicationNetworks F :Exampleofkey-valuepairs. signs, and the packet

Security and Communication Networks 13

(1) extract all plaintext keywords from HTTP headers of thetested app 119860 Denote the set of keywords as 119882(119860)

(2) extract all plaintext keywords from HTTP headers of otheridentified apps Denote the set of keywords of identifiedapp 119894 as 119882(119894) 119894 = 0 1 119899 minus 1

(3) while 0 le 119899 do(4) 119878119888119900119903119890119904[119894]V119886119897119906119890 = 119872119886119905119888ℎ119894119899119892(119882(119860) 119882(119894))(5) 119878119888119900119903119890119904[119894]119886119901119901 = 119894(6) 119894 + +(7) end while(8) sorting 119878119888119900119903119890119904[] with 119878119888119900119903119890119904[119894]V119886119897119906119890 by decreasing order(9) return the first 119898119878119888119900119903119890119904[119894]119886119901119901 119894 = 0 1 119898 minus 1

Algorithm 3 Searching similar apps locally

Table 10 Experimental results when the fact that what traffic isgenerated by what app is exactly known In the experiments 1198961 =1198962 = 4 119905119904 = 2 and 119892 = 5

Selected maliciousapps

Remaining maliciousapps

Detection rate 983 85False positiverate 003 27

that the traffic generated by apps was already known (theirdata were collected on the mobile device so their methodis actually a kind of client-side approach) In our workmuch larger samples are taken into account and constrainedclustering is exploited to determine to what app the flowsbelong The accuracy of clustering is also a factor affectingthe detection rate To confirm this we further take the sameassumption as Garg et alrsquos method [6] namely we assumethe accuracy of Algorithm 1 is 100 and know exactly whattraffic is generated by what app (in our experiments we useapp Packet Capture to obtain perfect ground truth of whatflows came from what app) At this time the experimentalresults are given in Table 10 Apparently the detection ratesare much improved

In practice one can deploy both our method and othermethods such as [6 15] simultaneously As indicated by ourexperiments AppFA is efficient in detecting repackaged andself-updatingmalicious apps In fact [15] is suitable forHTTPanalysis and [6] can detect CampC communication efficientlySo our work can be a complement of existing work onmalicious app detection

In this work AppFA mainly uses the statistical features(refer to Table 4) of network traffic to detect Android mali-cious repackaged applications Therefore attackers (repack-aged apps) may change the characteristics of their traffic toremain undetectable But it is quite difficult in practice Asillustrated in Section 2 repackaged apps usually introduceadditional network traffic thus the attackers must removesome normal network connections to keep network behav-iors the same to evade detection However the removal of

normal network connections will impact the functionalitiesof apps and may cause errors This gives additional clues todetect repackaged malware Meanwhile the disappearanceof some network connections may be abnormal as wellTherefore our proposed method is hard to evade

7 Conclusion

In this paper we propose a novel approach AppFA to detectmalicious apps at the network level In AppFA apps are firstidentified from network traffic by signature matching andconstrained clustering Then Kernel Principal ComponentAnalysis is employed to construct app network behavior pro-file and distinguish minor traffic variations from significantdifferences At last we take advantage of peer group analysisto detect malicious apps to avoid time-consuming offlinemodel training Notably AppFA does not need to installprograms or modify operating systems to collect featureinformation Thus it is very convenient to be used and thecost is low The experimental results show that AppFA candetect Android repackaged malware with the detection ratehigher than 90 and a false positive rate lower than 04

The appsrsquo network behaviorsmay be significantly changedby version update and thus cause false positives This needsto be further investigated In the future work AppFA willbe extended to include more network traffic features anddetect more types of malicious apps We will also makeAppFA an open-source project to facilitate further researchesof malicious app detection

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by National Natural Science Founda-tion of China under Grants 61702282 and 61502250 NaturalScience Foundation of the Jiangsu Higher Education Insti-tutions of China under Grant 17KJB520023 NUPTSF underGrant NY217143 and Nanjing Forestry University (GXL016CX2016026)

14 Security and Communication Networks

References

[1] Android statistics Number of android applications September2017 httpwwwappbraincomstatsnumber-of-android-apps

[2] A Mylonas M Theoharidou and D Gritzalis ldquoAssessingPrivacy Risks in Android A User-Centric Approachrdquo in RiskAssessment and Risk-Driven Testing Lecture Notes in ComputerScience pp 21ndash37 Springer International Publishing Cham2014

[3] Does Does your mobile anti-virus app protect or infect you thetruth behind du antivirus security September 2017 httpsresearchcheckpointcommobile-anti-virus-app-protect-infect-truth-behind-du-antivirus-security

[4] A P Felt M Finifter E Chin S Hanna and D Wagner ldquoAsurvey of mobile malware in the wildrdquo in Proceedings of the 1stACM Workshop on Security and Privacy in Smartphones andMobile Devices (SPSM rsquo11) held in Association with the 18th ACMConference on Computer and Communications Security (CCSrsquo11) pp 3ndash14 October 2011

[5] K Tam A Feizollah N B Anuar R Salleh and L Caval-laro ldquoThe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys vol 49 no 4 Article ID3017427 2017

[6] S Garg S K Peddoju and A K Sarje ldquoNetwork-based detec-tion of Android malicious appsrdquo International Journal of Infor-mation Security pp 1ndash16 2016

[7] A Shabtai L Tenenboim-Chekina D Mimran L Rokach BShapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquo Com-puters amp Security vol 43 pp 1ndash18 2014

[8] L Xie X Zhang J-P Seifert and S Zhu ldquoPBMDS a behavior-based malware detection system for cellphone devicesrdquo inProceedings of the 3rd ACM Conference on Wireless NetworkSecurity (WiSec rsquo10) pp 37ndash48 Hoboken NJ USAMarch 2010

[9] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacy inSmartphones and Mobile Devices (SPSM rsquo11) Held in Associationwith the 18th ACM Conference on Computer and Communica-tions Security (CCS rsquo11) pp 15ndash25 October 2011

[10] GDini FMartinelli A Saracino andD Sgandurra ldquoMADAMa multi-level anomaly detector for android malwarerdquo in Com-puter Network Security vol 7531 of Lecture Notes in ComputerScience pp 240ndash253 Springer Berlin Germany 2012

[11] Y Zhang M Yang B Xu et al ldquoVetting undesirable behaviorsin android apps with permission use analysisrdquo in Proceedingsof the in Proceedings of the 2013 ACM SIGSAC conference onComputer amp communications security pp 611ndash622 2013

[12] S Wang Z Chen X Li L Wang K Ji and C Zhao ldquoAndroidMalware Clustering Analysis on Network-Level Behaviorrdquo inIntelligent Computing Theories and Application vol 10361 ofLecture Notes in Computer Science pp 796ndash807 SpringerInternational Publishing Cham Switzerland 2017

[13] A S Shamili C Bauckhage and T Alpcan ldquoMalware detectionon mobile devices using distributed machine learningrdquo inProceedings of the 20th International Conference on PatternRecognition (ICPR rsquo10) pp 4348ndash4351 August 2010

[14] Y Wang J Wei and K Vangury ldquoBring your own device secu-rity issues and challengesrdquo in Proceedings of the 2014 IEEE 11thConsumer Communications and Networking Conference CCNC2014 pp 80ndash85 USA January 2014

[15] P S Chen S-C Lin and C-H Sun ldquoSimple and effectivemethod for detecting abnormal internet behaviors of mobiledevicesrdquo Information Sciences vol 321 Article ID 11536 pp 193ndash204 2015

[16] Global mobile os market share in sales to end users from 1stquarter 2009 to 1st quarter 2017 2017 httpswwwstatistacomstatistics266136global-market-share-held-by-smartphone-op-eratingsystems

[17] ldquoIt threat evolution q3 2017 statisticsrdquo November 2017 httpssecurelistcomit-threat-evolution-q3-2017-statistics83131

[18] L Li D Li T F Bissyande et al ldquoUnderstanding Android AppPiggybacking A Systematic Study of Malicious Code GraftingrdquoIEEE Transactions on Information Forensics and Security vol 12no 6 pp 1269ndash1284 2017

[19] Y Zhou and X Jiang ldquoDissecting android malware characteri-zation and evolutionrdquo in Proceedings of the 33rd IEEE Sympo-sium on Security and Privacy pp 95ndash109 San Francisco CalifUSA May 2012

[20] P Faruki V Laxmi A Bharmal M S Gaur and V Gan-moor ldquoAndroSimilar Robust signature for detecting variantsof Android malwarerdquo Journal of Information Security andApplications vol 22 pp 66ndash80 2015

[21] Y Zhou and X JiangAn analysis of the anserverbot trojan 2011httpwwwcscncsuedufacultyjiangpubsAnserverBot Ana-lysispdf

[22] P Faruki A Bharmal V Laxmi et al ldquoAndroid security a sur-vey of issues malware penetration and defensesrdquo IEEE Com-munications Surveys amp Tutorials vol 17 no 2 pp 998ndash10222015

[23] P Yan and Z Yan ldquoA survey on dynamic mobile malwaredetectionrdquo Software Quality Journal pp 1ndash29 2017

[24] S Dai Y Liu T Wang T Wei and W Zou ldquoBehavior-basedmalware detection on mobile phonerdquo in Proceedings of the 20106th International Conference on Wireless Communications Net-working and Mobile Computing WiCOM 2010 China Septem-ber 2010

[25] D Damopoulos G Kambourakis S Gritzalis and S O ParkldquoExposing mobile malware from the inside (or what is yourmobile app really doing)rdquo Peer-to-Peer Networking and Appli-cations vol 7 no 4 pp 687ndash697 2014

[26] D Damopoulos G Kambourakis and G Portokalidis ldquoThebest of both worlds A framework for the synergistic operationof host and cloud anomaly-based IDS for smartphonesrdquo inProceedings of the 7th European Workshop on System SecurityEuroSec 2014 Netherlands April 2014

[27] Q Xu J Erman A Gerber Z Mao J Pang and S Venkatara-man ldquoIdentifying diverse usage behaviors of smartphone appsrdquoin Proceedings of the ACM SIGCOMM Internet MeasurementConference (IMC rsquo11) pp 329ndash344 November 2011

[28] S Dai A Tongaonkar X Wang A Nucci and D Song ldquoNet-workProfiler Towards automatic fingerprinting of Androidappsrdquo in Proceedings of the 32nd IEEE Conference on ComputerCommunications IEEE INFOCOM 2013 pp 809ndash817 ItalyApril 2013

[29] Q Xu T Andrews Y Liao et al ldquoFlowr A self-learning systemfor classifying mobile application trafficrdquo in Proceedings of the2014 ACM SIGMETRICS International Conference on Measure-ment and Modeling of Computer Systems SIGMETRICS 2014pp 569-570 usa June 2014

[30] Q Xu Y Liao S Miskovic et al ldquoAutomatic generation ofmobile app signatures from traffic observationsrdquo in Proceedings

Security and Communication Networks 15

of the 34th IEEE Annual Conference on Computer Communica-tions and Networks IEEE INFOCOM 2015 pp 1481ndash1489 HongKong May 2015

[31] J Sun L She H Chen et al ldquoAutomatically identifying appsin mobile trafficrdquo Concurrency Computation vol 28 no 14 pp3927ndash3941 2016

[32] G He B Xu and H Zhu ldquoIdentifying Mobile Applications forEncrypted Network Trafficrdquo in Proceedings of the 5th Interna-tional Conference on Advanced Cloud and Big Data CBD 2017pp 279ndash284 China August 2017

[33] G He M Yang X Gu J Luo and Y Ma ldquoA novel activewebsite fingerprinting attack against Tor anonymous systemrdquoin Proceedings of the 2014 18th IEEE International Conference onComputer Supported CooperativeWork inDesign CSCWD2014pp 112ndash117 Taiwan May 2014

[34] G He M Yang J Luo and X Gu ldquoA novel application classi-fication attack against Torrdquo Concurrency and ComputationPractice and Experience vol 27 no 18 pp 5640ndash5661 2015

[35] B Miller L Huang A D Joseph and J D Tygar ldquoI know whyyou went to the clinic Risks and realization of HTTPS trafficanalysisrdquo Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes in Bio-informatics) Preface vol 8555 pp 143ndash163 2014

[36] Y Wang Y Xiang J Zhang W Zhou G Wei and L T YangldquoInternet traffic classification using constrained clusteringrdquoIEEE Transactions on Parallel and Distributed Systems vol 25no 11 pp 2932ndash2943 2014

[37] L J Cao K S Chua W K Chong H P Lee and Q M Gu ldquoAcomparison of PCA KPCA and ICA for dimensionality reduc-tion in support vector machinerdquo Neurocomputing vol 55 no1-2 pp 321ndash336 2003

[38] Y Kim and S Y Sohn ldquoStock fraud detection using peer groupanalysisrdquo Expert Systems with Applications vol 39 no 10 pp8986ndash8992 2012

[39] ldquoOurmon network monitoring and anomaly detection systemrdquoApril 2013 httpourmonsourceforgenet

[40] B Babaki T Guns and S Nijssen ldquoConstrained ClusteringUsing Column Generationrdquo in Integration of AI and OR Tech-niques in Constraint Programming vol 8451 of Lecture Notes inComputer Science pp 438ndash454 Springer International Publish-ing Cham 2014

[41] A Abraham R Andriatsimandefitra A Brunelat J-F Lalandeand V Viet Triem Tong ldquoGroddDroid A gorilla for triggeringmalicious behaviorsrdquo in Proceedings of the 10th InternationalConference on Malicious and Unwanted Software MALWARE2015 pp 119ndash127 USA October 2015

[42] Y Li Z Yang Y Guo and X Chen ldquoDroidBot A lightweightUI-guided test input generator for androidrdquo inProceedings of the39th IEEEACM International Conference on Software Engineer-ing Companion ICSE-C 2017 pp 23ndash26 Argentina May 2017

[43] A Aizawa ldquoAn information-theoretic perspective of tf-idfmeasuresrdquo Information Processing amp Management vol 39 no1 pp 45ndash65 2003

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 14: AppFA: A Novel Approach to Detect Malicious …downloads.hindawi.com/journals/scn/2018/2854728.pdfSecurityandCommunicationNetworks F :Exampleofkey-valuepairs. signs, and the packet

14 Security and Communication Networks

References

[1] Android statistics Number of android applications September2017 httpwwwappbraincomstatsnumber-of-android-apps

[2] A Mylonas M Theoharidou and D Gritzalis ldquoAssessingPrivacy Risks in Android A User-Centric Approachrdquo in RiskAssessment and Risk-Driven Testing Lecture Notes in ComputerScience pp 21ndash37 Springer International Publishing Cham2014

[3] Does Does your mobile anti-virus app protect or infect you thetruth behind du antivirus security September 2017 httpsresearchcheckpointcommobile-anti-virus-app-protect-infect-truth-behind-du-antivirus-security

[4] A P Felt M Finifter E Chin S Hanna and D Wagner ldquoAsurvey of mobile malware in the wildrdquo in Proceedings of the 1stACM Workshop on Security and Privacy in Smartphones andMobile Devices (SPSM rsquo11) held in Association with the 18th ACMConference on Computer and Communications Security (CCSrsquo11) pp 3ndash14 October 2011

[5] K Tam A Feizollah N B Anuar R Salleh and L Caval-laro ldquoThe evolution of android malware and android analysistechniquesrdquo ACM Computing Surveys vol 49 no 4 Article ID3017427 2017

[6] S Garg S K Peddoju and A K Sarje ldquoNetwork-based detec-tion of Android malicious appsrdquo International Journal of Infor-mation Security pp 1ndash16 2016

[7] A Shabtai L Tenenboim-Chekina D Mimran L Rokach BShapira and Y Elovici ldquoMobile malware detection throughanalysis of deviations in application network behaviorrdquo Com-puters amp Security vol 43 pp 1ndash18 2014

[8] L Xie X Zhang J-P Seifert and S Zhu ldquoPBMDS a behavior-based malware detection system for cellphone devicesrdquo inProceedings of the 3rd ACM Conference on Wireless NetworkSecurity (WiSec rsquo10) pp 37ndash48 Hoboken NJ USAMarch 2010

[9] I Burguera U Zurutuza and S Nadjm-Tehrani ldquoCrowdroidbehavior-based malware detection system for androidrdquo inProceedings of the 1st ACMWorkshop on Security and Privacy inSmartphones and Mobile Devices (SPSM rsquo11) Held in Associationwith the 18th ACM Conference on Computer and Communica-tions Security (CCS rsquo11) pp 15ndash25 October 2011

[10] GDini FMartinelli A Saracino andD Sgandurra ldquoMADAMa multi-level anomaly detector for android malwarerdquo in Com-puter Network Security vol 7531 of Lecture Notes in ComputerScience pp 240ndash253 Springer Berlin Germany 2012

[11] Y Zhang M Yang B Xu et al ldquoVetting undesirable behaviorsin android apps with permission use analysisrdquo in Proceedingsof the in Proceedings of the 2013 ACM SIGSAC conference onComputer amp communications security pp 611ndash622 2013

[12] S Wang Z Chen X Li L Wang K Ji and C Zhao ldquoAndroidMalware Clustering Analysis on Network-Level Behaviorrdquo inIntelligent Computing Theories and Application vol 10361 ofLecture Notes in Computer Science pp 796ndash807 SpringerInternational Publishing Cham Switzerland 2017

[13] A S Shamili C Bauckhage and T Alpcan ldquoMalware detectionon mobile devices using distributed machine learningrdquo inProceedings of the 20th International Conference on PatternRecognition (ICPR rsquo10) pp 4348ndash4351 August 2010

[14] Y Wang J Wei and K Vangury ldquoBring your own device secu-rity issues and challengesrdquo in Proceedings of the 2014 IEEE 11thConsumer Communications and Networking Conference CCNC2014 pp 80ndash85 USA January 2014

[15] P S Chen S-C Lin and C-H Sun ldquoSimple and effectivemethod for detecting abnormal internet behaviors of mobiledevicesrdquo Information Sciences vol 321 Article ID 11536 pp 193ndash204 2015

[16] Global mobile os market share in sales to end users from 1stquarter 2009 to 1st quarter 2017 2017 httpswwwstatistacomstatistics266136global-market-share-held-by-smartphone-op-eratingsystems

[17] ldquoIt threat evolution q3 2017 statisticsrdquo November 2017 httpssecurelistcomit-threat-evolution-q3-2017-statistics83131

[18] L Li D Li T F Bissyande et al ldquoUnderstanding Android AppPiggybacking A Systematic Study of Malicious Code GraftingrdquoIEEE Transactions on Information Forensics and Security vol 12no 6 pp 1269ndash1284 2017

[19] Y Zhou and X Jiang ldquoDissecting android malware characteri-zation and evolutionrdquo in Proceedings of the 33rd IEEE Sympo-sium on Security and Privacy pp 95ndash109 San Francisco CalifUSA May 2012

[20] P Faruki V Laxmi A Bharmal M S Gaur and V Gan-moor ldquoAndroSimilar Robust signature for detecting variantsof Android malwarerdquo Journal of Information Security andApplications vol 22 pp 66ndash80 2015

[21] Y Zhou and X JiangAn analysis of the anserverbot trojan 2011httpwwwcscncsuedufacultyjiangpubsAnserverBot Ana-lysispdf

[22] P Faruki A Bharmal V Laxmi et al ldquoAndroid security a sur-vey of issues malware penetration and defensesrdquo IEEE Com-munications Surveys amp Tutorials vol 17 no 2 pp 998ndash10222015

[23] P Yan and Z Yan ldquoA survey on dynamic mobile malwaredetectionrdquo Software Quality Journal pp 1ndash29 2017

[24] S Dai Y Liu T Wang T Wei and W Zou ldquoBehavior-basedmalware detection on mobile phonerdquo in Proceedings of the 20106th International Conference on Wireless Communications Net-working and Mobile Computing WiCOM 2010 China Septem-ber 2010

[25] D Damopoulos G Kambourakis S Gritzalis and S O ParkldquoExposing mobile malware from the inside (or what is yourmobile app really doing)rdquo Peer-to-Peer Networking and Appli-cations vol 7 no 4 pp 687ndash697 2014

[26] D Damopoulos G Kambourakis and G Portokalidis ldquoThebest of both worlds A framework for the synergistic operationof host and cloud anomaly-based IDS for smartphonesrdquo inProceedings of the 7th European Workshop on System SecurityEuroSec 2014 Netherlands April 2014

[27] Q Xu J Erman A Gerber Z Mao J Pang and S Venkatara-man ldquoIdentifying diverse usage behaviors of smartphone appsrdquoin Proceedings of the ACM SIGCOMM Internet MeasurementConference (IMC rsquo11) pp 329ndash344 November 2011

[28] S Dai A Tongaonkar X Wang A Nucci and D Song ldquoNet-workProfiler Towards automatic fingerprinting of Androidappsrdquo in Proceedings of the 32nd IEEE Conference on ComputerCommunications IEEE INFOCOM 2013 pp 809ndash817 ItalyApril 2013

[29] Q Xu T Andrews Y Liao et al ldquoFlowr A self-learning systemfor classifying mobile application trafficrdquo in Proceedings of the2014 ACM SIGMETRICS International Conference on Measure-ment and Modeling of Computer Systems SIGMETRICS 2014pp 569-570 usa June 2014

[30] Q Xu Y Liao S Miskovic et al ldquoAutomatic generation ofmobile app signatures from traffic observationsrdquo in Proceedings

Security and Communication Networks 15

of the 34th IEEE Annual Conference on Computer Communica-tions and Networks IEEE INFOCOM 2015 pp 1481ndash1489 HongKong May 2015

[31] J Sun L She H Chen et al ldquoAutomatically identifying appsin mobile trafficrdquo Concurrency Computation vol 28 no 14 pp3927ndash3941 2016

[32] G He B Xu and H Zhu ldquoIdentifying Mobile Applications forEncrypted Network Trafficrdquo in Proceedings of the 5th Interna-tional Conference on Advanced Cloud and Big Data CBD 2017pp 279ndash284 China August 2017

[33] G He M Yang X Gu J Luo and Y Ma ldquoA novel activewebsite fingerprinting attack against Tor anonymous systemrdquoin Proceedings of the 2014 18th IEEE International Conference onComputer Supported CooperativeWork inDesign CSCWD2014pp 112ndash117 Taiwan May 2014

[34] G He M Yang J Luo and X Gu ldquoA novel application classi-fication attack against Torrdquo Concurrency and ComputationPractice and Experience vol 27 no 18 pp 5640ndash5661 2015

[35] B Miller L Huang A D Joseph and J D Tygar ldquoI know whyyou went to the clinic Risks and realization of HTTPS trafficanalysisrdquo Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes in Bio-informatics) Preface vol 8555 pp 143ndash163 2014

[36] Y Wang Y Xiang J Zhang W Zhou G Wei and L T YangldquoInternet traffic classification using constrained clusteringrdquoIEEE Transactions on Parallel and Distributed Systems vol 25no 11 pp 2932ndash2943 2014

[37] L J Cao K S Chua W K Chong H P Lee and Q M Gu ldquoAcomparison of PCA KPCA and ICA for dimensionality reduc-tion in support vector machinerdquo Neurocomputing vol 55 no1-2 pp 321ndash336 2003

[38] Y Kim and S Y Sohn ldquoStock fraud detection using peer groupanalysisrdquo Expert Systems with Applications vol 39 no 10 pp8986ndash8992 2012

[39] ldquoOurmon network monitoring and anomaly detection systemrdquoApril 2013 httpourmonsourceforgenet

[40] B Babaki T Guns and S Nijssen ldquoConstrained ClusteringUsing Column Generationrdquo in Integration of AI and OR Tech-niques in Constraint Programming vol 8451 of Lecture Notes inComputer Science pp 438ndash454 Springer International Publish-ing Cham 2014

[41] A Abraham R Andriatsimandefitra A Brunelat J-F Lalandeand V Viet Triem Tong ldquoGroddDroid A gorilla for triggeringmalicious behaviorsrdquo in Proceedings of the 10th InternationalConference on Malicious and Unwanted Software MALWARE2015 pp 119ndash127 USA October 2015

[42] Y Li Z Yang Y Guo and X Chen ldquoDroidBot A lightweightUI-guided test input generator for androidrdquo inProceedings of the39th IEEEACM International Conference on Software Engineer-ing Companion ICSE-C 2017 pp 23ndash26 Argentina May 2017

[43] A Aizawa ldquoAn information-theoretic perspective of tf-idfmeasuresrdquo Information Processing amp Management vol 39 no1 pp 45ndash65 2003

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 15: AppFA: A Novel Approach to Detect Malicious …downloads.hindawi.com/journals/scn/2018/2854728.pdfSecurityandCommunicationNetworks F :Exampleofkey-valuepairs. signs, and the packet

Security and Communication Networks 15

of the 34th IEEE Annual Conference on Computer Communica-tions and Networks IEEE INFOCOM 2015 pp 1481ndash1489 HongKong May 2015

[31] J Sun L She H Chen et al ldquoAutomatically identifying appsin mobile trafficrdquo Concurrency Computation vol 28 no 14 pp3927ndash3941 2016

[32] G He B Xu and H Zhu ldquoIdentifying Mobile Applications forEncrypted Network Trafficrdquo in Proceedings of the 5th Interna-tional Conference on Advanced Cloud and Big Data CBD 2017pp 279ndash284 China August 2017

[33] G He M Yang X Gu J Luo and Y Ma ldquoA novel activewebsite fingerprinting attack against Tor anonymous systemrdquoin Proceedings of the 2014 18th IEEE International Conference onComputer Supported CooperativeWork inDesign CSCWD2014pp 112ndash117 Taiwan May 2014

[34] G He M Yang J Luo and X Gu ldquoA novel application classi-fication attack against Torrdquo Concurrency and ComputationPractice and Experience vol 27 no 18 pp 5640ndash5661 2015

[35] B Miller L Huang A D Joseph and J D Tygar ldquoI know whyyou went to the clinic Risks and realization of HTTPS trafficanalysisrdquo Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes in Bio-informatics) Preface vol 8555 pp 143ndash163 2014

[36] Y Wang Y Xiang J Zhang W Zhou G Wei and L T YangldquoInternet traffic classification using constrained clusteringrdquoIEEE Transactions on Parallel and Distributed Systems vol 25no 11 pp 2932ndash2943 2014

[37] L J Cao K S Chua W K Chong H P Lee and Q M Gu ldquoAcomparison of PCA KPCA and ICA for dimensionality reduc-tion in support vector machinerdquo Neurocomputing vol 55 no1-2 pp 321ndash336 2003

[38] Y Kim and S Y Sohn ldquoStock fraud detection using peer groupanalysisrdquo Expert Systems with Applications vol 39 no 10 pp8986ndash8992 2012

[39] ldquoOurmon network monitoring and anomaly detection systemrdquoApril 2013 httpourmonsourceforgenet

[40] B Babaki T Guns and S Nijssen ldquoConstrained ClusteringUsing Column Generationrdquo in Integration of AI and OR Tech-niques in Constraint Programming vol 8451 of Lecture Notes inComputer Science pp 438ndash454 Springer International Publish-ing Cham 2014

[41] A Abraham R Andriatsimandefitra A Brunelat J-F Lalandeand V Viet Triem Tong ldquoGroddDroid A gorilla for triggeringmalicious behaviorsrdquo in Proceedings of the 10th InternationalConference on Malicious and Unwanted Software MALWARE2015 pp 119ndash127 USA October 2015

[42] Y Li Z Yang Y Guo and X Chen ldquoDroidBot A lightweightUI-guided test input generator for androidrdquo inProceedings of the39th IEEEACM International Conference on Software Engineer-ing Companion ICSE-C 2017 pp 23ndash26 Argentina May 2017

[43] A Aizawa ldquoAn information-theoretic perspective of tf-idfmeasuresrdquo Information Processing amp Management vol 39 no1 pp 45ndash65 2003

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 16: AppFA: A Novel Approach to Detect Malicious …downloads.hindawi.com/journals/scn/2018/2854728.pdfSecurityandCommunicationNetworks F :Exampleofkey-valuepairs. signs, and the packet

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom