research openaccess …approach can automatically identify and exploit vulner-abilities. it is also...

16
Akrout et al. Journal of the Brazilian Computer Society 2014, 20:4 http://www.journal-bcs.com/content/20/1/4 RESEARCH Open Access An automated black box approach for web vulnerability identification and attack scenario generation Rim Akrout 1,2* , Eric Alata 1,2 , Mohamed Kaaniche 1,2 and Vincent Nicomette 1,2 Abstract Web applications have become increasingly vulnerable and exposed to malicious attacks that could affect essential properties of information systems such as confidentiality, integrity, or availability. To cope with these threats, it is necessary to develop efficient security protection mechanisms and assessment techniques (firewall, intrusion detection system, Web scanner, etc.). This paper presents a new methodology, based on Web page clustering techniques, that is aimed at identifying the vulnerabilities of a Web application following a black box analysis of the target application. Each identified vulnerability is actually exploited to ensure that it does not correspond to a false positive. The proposed approach can also highlight different potential attack scenarios including the exploitation of several successive vulnerabilities, taking into account explicitly the dependencies between these vulnerabilities. We have focused in particular on code injection vulnerabilities, such as SQL injections. The proposed methodology led to the development of a new Web vulnerability scanner that has been validated experimentally on several examples of vulnerable applications. Keywords: Web application; Vulnerabilities; Attacks; Evaluation; Web scanner 1 Background 1.1 Introduction Web application vulnerabilities have become, in the recent years, a major threat to computer systems security. This is illustrated in, e.g., the IBM X-force 2012 mid-year trend and risk report which shows that Web application vulner- abilities including SQL injections and Cross-site scripting occupy the highest positions in computer threats [1]. This situation can be explained by the increase in complexity of Web technologies, by the frequent evolution of these technologies, by the short development cycles of Web applications during which testing and validation activities are limited, and also, in some cases, by the lack of security skills and culture of the developers. In this paper, we propose a novel methodology that allows to automatically identify residual vulnerabilities of a Web application from the analysis of the targeted appli- cation, following a black box approach. The proposed *Correspondence: [email protected] 1 CNRS, LAAS, 7 avenue du Colonel Roche, Toulouse 31400, France 2 Univ de Toulouse, INSA, LAAS, Toulouse 31400, France approach can automatically identify and exploit vulner- abilities. It is also designed to highlight potential attack scenarios including the exploitation of several successive vulnerabilities that are not necessarily independent. The identification of these scenarios is based on the dynamic crawling of the application, resulting in the creation of a navigation graph that describes the different possibilities for a user to activate the application and associated vulner- abilities. This graph explicitly represents the dependen- cies between the vulnerabilities of the site and thereafter various attack scenarios. To validate our approach, we developed a new vulnerability scanner that has been val- idated on different vulnerable applications and compared experimentally with other existing vulnerability scanners. This paper extends the results presented in [2] and [3]. It is structured into seven sections. Section 1.2 discusses related work focusing on the analysis of the vulnerability detection algorithm used by several well- known freeware vulnerability scanners and presents some weaknesses of these algorithms. Section 2.1 presents our clustering algorithm for detecting Web application vul- nerabilities. Section 2.2 presents an overview of our © 2014 Akrout et al.; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Upload: others

Post on 06-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RESEARCH OpenAccess …approach can automatically identify and exploit vulner-abilities. It is also designed to highlight potential attack scenarios including the exploitation of several

Akrout et al. Journal of the Brazilian Computer Society 2014, 20:4http://www.journal-bcs.com/content/20/1/4

RESEARCH Open Access

An automated black box approach for webvulnerability identification and attack scenariogenerationRim Akrout1,2*, Eric Alata1,2, Mohamed Kaaniche1,2 and Vincent Nicomette1,2

Abstract

Web applications have become increasingly vulnerable and exposed to malicious attacks that could affect essentialproperties of information systems such as confidentiality, integrity, or availability. To cope with these threats, it isnecessary to develop efficient security protection mechanisms and assessment techniques (firewall, intrusiondetection system, Web scanner, etc.). This paper presents a new methodology, based on Web page clusteringtechniques, that is aimed at identifying the vulnerabilities of a Web application following a black box analysis of thetarget application. Each identified vulnerability is actually exploited to ensure that it does not correspond to a falsepositive. The proposed approach can also highlight different potential attack scenarios including the exploitation ofseveral successive vulnerabilities, taking into account explicitly the dependencies between these vulnerabilities. Wehave focused in particular on code injection vulnerabilities, such as SQL injections. The proposed methodology led tothe development of a new Web vulnerability scanner that has been validated experimentally on several examples ofvulnerable applications.

Keywords: Web application; Vulnerabilities; Attacks; Evaluation; Web scanner

1 Background1.1 IntroductionWeb application vulnerabilities have become, in the recentyears, a major threat to computer systems security. Thisis illustrated in, e.g., the IBM X-force 2012 mid-year trendand risk report which shows that Web application vulner-abilities including SQL injections and Cross-site scriptingoccupy the highest positions in computer threats [1]. Thissituation can be explained by the increase in complexityof Web technologies, by the frequent evolution of thesetechnologies, by the short development cycles of Webapplications during which testing and validation activitiesare limited, and also, in some cases, by the lack of securityskills and culture of the developers.In this paper, we propose a novel methodology that

allows to automatically identify residual vulnerabilities ofa Web application from the analysis of the targeted appli-cation, following a black box approach. The proposed

*Correspondence: [email protected], LAAS, 7 avenue du Colonel Roche, Toulouse 31400, France2Univ de Toulouse, INSA, LAAS, Toulouse 31400, France

approach can automatically identify and exploit vulner-abilities. It is also designed to highlight potential attackscenarios including the exploitation of several successivevulnerabilities that are not necessarily independent. Theidentification of these scenarios is based on the dynamiccrawling of the application, resulting in the creation of anavigation graph that describes the different possibilitiesfor a user to activate the application and associated vulner-abilities. This graph explicitly represents the dependen-cies between the vulnerabilities of the site and thereaftervarious attack scenarios. To validate our approach, wedeveloped a new vulnerability scanner that has been val-idated on different vulnerable applications and comparedexperimentally with other existing vulnerability scanners.This paper extends the results presented in [2] and

[3]. It is structured into seven sections. Section 1.2discusses related work focusing on the analysis of thevulnerability detection algorithm used by several well-known freeware vulnerability scanners and presents someweaknesses of these algorithms. Section 2.1 presents ourclustering algorithm for detecting Web application vul-nerabilities. Section 2.2 presents an overview of our

© 2014 Akrout et al.; licensee Springer. This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproductionin any medium, provided the original work is properly cited.

Page 2: RESEARCH OpenAccess …approach can automatically identify and exploit vulner-abilities. It is also designed to highlight potential attack scenarios including the exploitation of several

Akrout et al. Journal of the Brazilian Computer Society 2014, 20:4 Page 2 of 16http://www.journal-bcs.com/content/20/1/4

approach for constructing the graph. The details of thealgorithm are outlined in Section 2.3. We present inSection 3 the experiments performed in order to vali-date our approach and assess the efficiency of our scan-ner. Finally, Section 4 concludes this paper and discussesfuture work.

1.2 Background and related workMost frequent attacks on Web servers include SQL injec-tion attacks (for Web servers connected to an SQLdatabase) and code injection attacks (Flash, Javascript,etc., carried out through so-called Cross-site scripting orXSS attacks). These attacks generally correspond to theexploitation of the same kind of vulnerability related to thelack of sanitization of URL parameters or of HTML forminputs. In the following, we will focus on SQL injectionattacks, without loss of generality.To check whether SQL injection attacks are possible,

the vulnerability scanners send specially crafted requestsand analyze the responses returned by the server. A servermay respond with a rejection page or with an executionpage. A rejection page is returned by the server as a con-sequence of an error while processing the request. Anexecution page is returned by the server as a consequenceof a successful execution of the request. This page maycorrespond to the ‘normal’ scenario, i.e., in the case of alegitimate use of the Web site, but may also result from asuccessful exploitation of an injection attack. These latterrequests are those we consider in this paper. In particu-lar, our objective is to identify the vulnerabilities that canbe successfully exploited by the attackers. For instance,the successful exploitation of an SQL injection vulnerabil-ity in a login form may lead to bypass an authentication,and the successful exploitation of a File include vulner-ability on a search form may lead to display extra datalike /etc/passwd file content. In order to identify thevulnerabilities of a Web site, the scanners generally sendspecially crafted requests via the identified injection pointsallowing them to determine whether the input parameterssubmitted to the target system are sanitized or not. Aninjection point is a piece of a Web page into which a codecan be injected: a parameter in the URL or a field of a form,etc. Overall, the identification of potential vulnerabilitiesis generally based on the characterization of responses of aWeb server to crafted requests sent via the injection pointsand the ability to distinguish rejection pages and executionpages.The issue is thus the analysis of the responses to

determine if they actually correspond to rejection orexecution pages. Two main approaches can be identi-fied based on the analysis of related work on this topic.These approaches are discussed in the next section. Inthe following, we denote as false positive the fact thata vulnerability scanner detects a vulnerability in a Web

page while this vulnerability does not exist. A false neg-ative occurs when the vulnerability scanner does notdetect a vulnerability in a Web page while it actuallyexists.Two main approaches exist to detect the presence of a

vulnerability in a Web application. The first one relies onan error pattern matching algorithm and is presented inSection 1.2.1. The second one relies on the analysis of sim-ilarities between the pages returned by the server and ispresented in Section 1.2.2. Lastly, Section 1.2.3 proposes adiscussion regarding the limits of these approaches.

1.2.1 Error patternmatching approachTo identify SQL injections, the error pattern matchingapproach consists in sending specially crafted requeststo the application and looking for specific patterns inthe responses, e.g., database error messages. The basicidea is that the presence of an SQL error message inan HTML response page means that the correspondingrequest has not been sanitized by the application. There-fore, the fact that this request has been sent unchangedto the SQL server reveals the presence of a vulnerabil-ity. Scanners such as W3af (http://w3af.sourceforge.net)(sqli module), Wapiti (http://wapiti.sourceforge.net),and Secubat [4] adopt such approach. As an exam-ple, to detect injection vulnerabilities in authenticationforms, the sqli module of W3af sends three requestsbased on the SQL injection: d’z"0 (or d%2Cz%220encoded in ASCII). The three corresponding responsesare then analyzed. If they include SQL error messages(e.g., Mysql_ and supplied argument is not avalid Mysql), W3af informs the user that the applica-tion is vulnerable.The list of keywords adopted by Secubat for the error

pattern matching approach is presented in [4]. This list,derived by analyzing response pages of vulnerable Websites, is aimed at covering a wide range of error responsesand a variety of database servers. A confidence factor thatmeasures the level of confidence that the attacked Webform is vulnerable is also assigned to each keyword.

1.2.2 Similarity approachThis approach relies on three assumptions: (1) executionand rejection pages are different, (2) it is easy to buildrequests that generate rejection pages (by generating ran-dom or syntactically invalid requests for instance), and (3)it is difficult to build requests including injection attacksthat actually generate execution pages (i.e., requests thatsuccessfully exploit a vulnerability). The principle of theapproach consists in sending different crafted requests tothe Web application and comparing the similarity of thecorresponding responses using a textual distance, in orderto identify rejection pages among the response pages (i.e.,pages that highlight non-sanitized inputs).

Page 3: RESEARCH OpenAccess …approach can automatically identify and exploit vulner-abilities. It is also designed to highlight potential attack scenarios including the exploitation of several

Akrout et al. Journal of the Brazilian Computer Society 2014, 20:4 Page 3 of 16http://www.journal-bcs.com/content/20/1/4

Let us consider as an example the approach adoptedby Skipfish (http://code.google.com/p/skipfish) fordetecting SQL injection vulnerabilities. Three requestsare sent to the Web application (A- ’", B- \’\", andC- \\’\\"). The responses are compared two by two.According to Skipfish, a vulnerability is present if bothresponses associated to B and C are not similar to theresponse associated to A. The similarity test uses a dis-tance based on the frequency of the words in the responsepages.The algorithm presented in [5] is also based on the

similarity approach. It differs from other implementa-tions in the additional use of the error pattern matchingapproach at a first step to guide the classification. Thesimilarity approach is used to address the uncertaintythat arises about the presence or absence of a vulner-ability when an injection does not generate an errormessage.

1.2.3 Discussion and contributionsThe assumption used by the error pattern matchingapproach is debatable. Indeed, error messages that areincluded in HTML response pages do not necessarilycome from the database server itself. A database relatederror message may also has been generated by the applica-tion. Moreover, even if the message is actually generatedby the database server, this is not sufficient to concludewhen receiving this message, that an SQL injection ispossible. Indeed, this message means that, for this par-ticular request, the inputs have not been sanitized. How-ever, it does not mean neither that the server does notsanitize all the SQL requests nor that the non-sanitizedinput can be chosen in order to successfully exploit anyvulnerability.Regarding the similarity approach, it is easier to gen-

erate random or syntactically invalid requests to obtainrejection pages than to send valid injection attacks toobtain execution pages. Since the main assumption isbased on the observation that the content of a rejec-tion page is generally different from the content of anexecution page, it is important to ensure a wide cover-age of the different types of rejection pages that couldbe generated by the application. This can be achievedby generating a large number of requests aimed atactivating different types of error pages. However, theexisting implementations of this approach generate toofew requests. For instance, Skipfish uses only threerequests.Also, as in any classification problem, the choice of the

distance is very important. The one used in Skipfishdoes not take into account the order of the words in a text.However, this order generally defines the semantics of thepage. Thus, it is important to take it into account to assessthe similarity, as performed in [5] with a text similarity

distance. As an example, the two following pages use thesame words in a different order, but they have differentsemantics:

- You are authenticated, you have notentered a wrong login.- You are not authenticated, you haveentered a wrong login.

The algorithm presented in Section 2.1 builds on someof the concepts of the similarity approach and addressesthe issues raised in the discussion above. In particular, itallows: (1) the generation of a large number of requeststhat can be tuned by the user, to activate different rejectionpages returned by the application, (2) the automatic gen-eration of various types of specially crafted requests usinga grammar, and (3) the automatic clustering of the cor-responding HTML pages returned by the Web server todistinguish between rejection pages and execution pagesand automatically identify successful injections. This algo-rithm is also designed to identify and successfully exploitvarious types of vulnerabilities. Besides SQL injections,the proposed approach can address XPATH, OS Com-manding, and File include vulnerabilities. While our workshares some of the ideas and objectives presented in [5],we follow different approaches. For example, the cluster-ing approach which is the core of our algorithm is not usedin [5]. They use a different technique combining errorpattern matching and the similarity analysis of applica-tion response pages. Also, their approach is firstly basedon error pattern matching and hence shares the sameconcerns raised above.

2 Methods2.1 Html page clustering for Web vulnerability detectionThe approach presented in this section seeks to achievethe automated detection of different types of Web vulner-abilities, corresponding to SQL injection attacks, OsCom-manding, File Include, and XPatha. It is based on theautomatic classification of responses returned by the Webserver using data clustering techniques and identifiesqueries that are able to successfully exploit vulnerabilities.

2.1.1 PrinciplesIn the following, let us take the example of an SQLinjection in an authentication form. Our goal is to iden-tify, among several SQL injections, those which allow anattacker to bypass the authentication. The main challengeis the automation of this process. In the following, wepresent a method which intends to reduce false negativesand false positives in comparison with existing solutionspresented in the previous section. Our approach relies onthe following assumptions: (a) the content of an execu-tion page is significantly different from the content of a

Page 4: RESEARCH OpenAccess …approach can automatically identify and exploit vulner-abilities. It is also designed to highlight potential attack scenarios including the exploitation of several

Akrout et al. Journal of the Brazilian Computer Society 2014, 20:4 Page 4 of 16http://www.journal-bcs.com/content/20/1/4

rejection page, (b) two rejection pages may be differentfrom each other, and (c) two execution pages may also bedifferent from each other even for the same class of inputs.In the login page that we consider, responses to valid

requests include welcome messages and invalid inputerror messages. Responses to invalid requests includePHP or SQL error messages. The essential point is theexistence of differences between execution pages andrejection pages, and, more precisely, between welcomemessages and invalid input error messages, and betweenPHP and SQL error messages. Our approach focuseson the analysis of these differences. The objective is toidentify, among several responses, those which corre-spond to execution pages generated through syntacticallyvalid requests. In other words, we learn the behavior ofthe application based on the clustering of Web serverresponse pages that are similar enough.The entry point of our algorithm is a set of initial

requests that have a common property: it is easy to clas-sify the associated responses (either as execution page orrejection page). Obviously, it is easier to generate requestswhich lead to rejection pages or execution pages corre-sponding to invalid input error messages than requestswhich lead to execution pages associated to welcomemes-sages. To generate the former, one can, for example, userandom usernames and passwords to fill the authentica-tion form. One can also easily generate requests whichlead to rejection pages associated to PHP or SQL errormessages.In our proposal, we distinguish three sets of requests:

Rr is the set of requests generated from words randomlychosen from the list [a-zA-Z0-9]+. They are very likelyto generate rejection pages or execution pagesassociated to invalid input error messages. Forexample:

http://address/directory/page.php?login=ABCDEF&pass=ABCDEF

Rii is the set of syntactically incorrect SQL injectionrequests that are inappropriate for the given injectionpoint. They are constructed to produce a syntax errorin the SQL query sent to the SQL server by the HTTPserver. Usually, these requests are composed of anodd number of quotes. They are also very likely togenerate rejection pages. For example:

http://address/directory/page.php?login=’’’&pass=’’’

Rvi is the set of syntactically correct SQL injectionrequests that are constructed to generate executionpages in the presence of vulnerabilities, but they

might as well generate rejection pages in the absenceof vulnerabilities. For example:

http://address/directory/page.php?login=test&pass=’ or ’1’=’1

The main issue is to determine whether the responseis a rejection page or an execution page. To do so,these responses are compared to those associated tosets Rr and Rii.

Let us note Sr, Sii, and Svi the responses associated to Rr,Rii, and Rvi, respectively. The principle of our algorithm isthen as follows: Rvi requests whose responses are not sim-ilar to any of the responses from Sii and Sr are consideredvalid SQL injections. To assess the similarity between thepages returned by different requests, we use a classifica-tion technique based on the distance presented in the nextsubsection.

2.1.2 DistanceTo analyze the similarity between two HTML pages, weneed a distance for assessing the difference between twostrings. As discussed in Section 1.2.3, the order of wordsin a text could be relevant. In fact, the same words in adifferent order can completely change the semantics ofthe response. Thus, to compute the distance between twopages, we use a normalized version of the Levenshtein dis-tance. Let a and b be two responses of length n and m.We also denote ai and bj the ith character in a and thejth character in b. The distance for clustering is defined inEquation 1.

diff(ai, bj) =

⎧⎪⎪⎨⎪⎪⎩

n − i + m − j i = n or j = mdiff(ai+1, bj+1) ai = bj, i < n, j < m1 + min(diff(ai+1, bj),diff(ai, bj+1))

ai �= bj, i < n, j < m(1)

d(a, b) = diff(a1, b1)n + m

.

Generally, clustering techniques are based on two differ-ent strategies. The first is driven by the number of clusters,if it is known a priori. It starts by considering a singlecluster containing all requests and divide it progressively,relying on distances, until the desired number of clus-ters. The second technique is used in case the numberof clusters is not known a priori. It consists in groupingin a same cluster the requests whose pairwise distance isbelow a threshold. In our approach, the number of clustersis not determined a priori, and so we use the second strat-egy, called hierarchical clustering [11], which requires thechoice of a threshold.

Page 5: RESEARCH OpenAccess …approach can automatically identify and exploit vulner-abilities. It is also designed to highlight potential attack scenarios including the exploitation of several

Akrout et al. Journal of the Brazilian Computer Society 2014, 20:4 Page 5 of 16http://www.journal-bcs.com/content/20/1/4

The threshold for grouping queries can vary from aninjection point to another. Indeed, it depends on thesize of the responses and the amount of data that dif-fer between two responses for the same type of requests.Also, this threshold must be adapted to each Web appli-cation. In our approach, the threshold is defined empir-ically by the shortest distance between: i) the maximumdistance between the responses belonging to Sr and ii)the maximum distance between the responses belongingto Sii.

2.1.3 Requests generatorOne important aspect of the proposed algorithm is itsability to identify the presence of a vulnerability in aninjection point based on multiple responses generatedfrom this injection point. To improve the accuracy of theresults, we need to generate a large number of responses,allowing to achieve a high coverage of the responsedomain. Note that other approaches are often based ona small number of responses (for example, three forSkipfish).One possible way to generate different types of

responses (and associated queries) is to record in a staticfile, queries obtained from security experts similar to, e.g.,to SQL sheets [12]). A more flexible approach would beto define a grammar to automate this process. Such anapproach can be compared to some extent to fuzzing tech-niques [13]. In the following, we outline the grammar thatwe have defined to automate the generation of Rr, Rii, andRvi requests.On theWeb server-side, most of the time, an SQL query

is created by concatenating SQL terms and parameterssent by the client. For example, the following PHP scriptdeals with the authentication of a user given the usernameand password sent by the client:

$query = "SELECT id FROM users WHEREname=’$name’ AND pass=’$pass’";

Given a semantically valid (user name, password) pair,the created SQL query is considered syntactically valid.Also, the created query associated to a (user name, pass-word) pair generated based on a dictionary attack isconsidered syntactically valid, even if the authenticationfailed. From this observation, an SQL injection is definedas a string that leads to a syntactically valid SQL querywhile changing the semantics of this generated SQL query.In many situations, an SQL injection is relevant if it leadsto a tautology in the WHERE clause of the forged SQLquery. From the previous example, an example of such anSQL injection is

name="’ OR 1=1 OR string=’"

Therefore, the grammar of SQL injections is just a partof the grammar of SQL queries. The advantage of a gram-mar is that it enables to easily generate as many SQLinjections as needed. We can apply the same reasoning tothe set of randomly generated words (Rr), and to the set ofSQL injections that are inappropriate for the given injec-tion point (Rii). A tiny grammar for the set Rvi (i.e., theset of SQL injection requests that are constructed in orderto generate execution pages) can be expressed using BNFbnotation as follows:

INJECTION := WORD’ POR TAUTAG [’ POR TAUTAG]

| WORD" POR TAUTAG [" POR TAUTAG]

POR := or / ) POR (

TAUTAG := hex(’A’)=’41

| ’1’=’1

| ’[f-m]’ between ’[a-e]’ and ’[n-z]’

WORD := [0-9a-zA-A]*

...

This grammar generates different variations of SQLinjection attacks. It consists in inserting a tautology insidean expression evaluated by a WHERE clause, in such away that this expression becomes a tautology itself. Toinject the tautology, the initial expression is splitted intoseveral pieces. The TAUTAG rules are examples of suchtautologies and the INJECTION rules express how thetautology is included in an initial expression, (1) by clos-ing the expression with delimiter characters (’, ", or )),(2) by inserting the tautology (through a disjunction), and(3) by opening a new expression using the same delimitercharacters.

2.1.4 Extension to other vulnerability classesThe approach proposed for SQL injection vulnerabilitydetection can be generalized. In fact, many attacks havethe same behavior: the client sends a string which changesthe semantics of the forged query. Depending on the con-text, the forged query is sent to a specific componentin Web server-side such as the XPath engine, operatingsystem, etc. The names of the corresponding injectionattacks are derived from the name of this component lead-ing to XPATH injection, Os Commanding, etc. Thus, theclustering algorithm that we have illustrated using theexample of SQL injection in the previous section can bealso used for these types of vulnerabilities.XPATH vulnerabilities consist, like SQL vulnerabilities,

in submitting not sanitized input in an HTML form orURL parameters. The difference is that the vulnerabilitycan be exploited to execute XPath queries and not SQLqueries.In the case of OS Commanding vulnerability, the string

sent by the client is used to create a command executedby the operating system. This command is executed under

Page 6: RESEARCH OpenAccess …approach can automatically identify and exploit vulner-abilities. It is also designed to highlight potential attack scenarios including the exploitation of several

Akrout et al. Journal of the Brazilian Computer Society 2014, 20:4 Page 6 of 16http://www.journal-bcs.com/content/20/1/4

the identity of the process corresponding to the Webserver. Exploitation of this vulnerability allows an attackerto execute arbitrary commands on the system and can alsoallow read and/or write access to certain files.As explained previously, the algorithm uses three sets

of queries: Ra (random request), Rii (syntactically invalidinjections), and Rvi (syntactically valid injections). Theadaptation of the algorithm to other types of vulnerabili-ties requires the definition of these three sets for each typeof vulnerability. Once these sets are established, the algo-rithm proceeds in the same way: send those requests, andstore the corresponding results obtained by the clusteringdistance presented in Section 2.1.2. Further details can befound in [9].

2.2 Attack scenarios with multiple vulnerabilitiesThe previous section described a methodology allow-ing the detection of a single vulnerability by automaticHTML page clustering. In this section, we present a globalapproach that aims at exhibiting attack scenarios, result-ing from the exploitation of several vulnerabilities, someof which may be causally dependent on each other.The objective of this approach is to establish these

scenarios from the automatic construction of the graphrepresenting all possible navigations on a site taking intoaccount its vulnerabilities. The approach is composed oftwo steps: (1) a first step aimed at identifying the differ-ent possibilities offered to a client to navigate through theweb site; and (2) a second step aimed at the identificationand exploitation of vulnerabilities in the web pages andinjection points identified at step 1, using the methodol-ogy presented in Section 2.1. The iteration of steps 1 and2 allows the elaboration of attack scenarios including theexploitation of multiple vulnerabilities.We adopt a black box approach since no details of the

source code implementation of the Web site is required.The public address of the Web site is used as a startingpoint for dynamic discovery of the site and its vulnerabil-ities. We begin this section by introducing some defini-tions that are useful for the presentation of our approach.Thereafter, we describe the principles of our approach.

2.2.1 DefinitionsOur approach aims at automatically building a graph thatrepresents the set of all possible navigations on aWeb site,including those that result from exploitations of the Website vulnerabilities. Let us call a navigation a sequenceof requests (a request consisting in the activation of anHTML link). A sequence of requests actually sent by aclient is called a trace. The set of all the possible navi-gations by a client during a visit to the Web site can berepresented by an automata called a navigation graph.A navigation state of a client (i.e., a browser) is com-

posed of (1) the HTML page currently displayed by the

browser and (2) the current values of the cookiesc in thebrowser. A request sent by a browser provokes a change ofthe current navigation state.Each node of the navigation graph corresponds to a nav-

igation state. An edge between two navigation states existsif a request whose execution leads to the transition froman initial state to the another one can be sent by the client.An edge may correspond to a ‘normal’ request or to arequest that exploits a vulnerability of theWeb site. A vul-nerability graph is a particular case of a navigation graphthat includes edges corresponding to the exploitation ofvulnerabilities.It is important to note that a navigation graph is differ-

ent from a traditional graph of HTML pages describingthe structure of a Web site. Each node of an HTML pagegraph generally corresponds to an HTML page of thesite and an edge between two nodes identifies a link thatenables to access the second page from the first one. Thedifference is mainly due to the fact that a navigation statedoes not only depend on the currently accessed HTMLpage. Indeed, a client can access an HTML page sev-eral times, while being in different navigation states. Forinstance, let us consider an e-business Web site. It is pos-sible to browse the purchasing page after having orderedsome products or not. When browsing this page, in thefirst case, the client is authorized to pay for the productsand in the second case, an error message is returned (itmakes no sense to purchase for an empty bag). However,in these two situations, the HTML page accessed is thesame. The difference is due to the content of the cookiesthat indicate that products have been ordered or not, i.e.,that reflect the current navigation state.

2.2.2 PrinciplesThe construction of the navigation graph is progressiveand dynamic by identifying different navigations and vul-nerabilities. In addition, a vulnerability exploitation opensnew possibilities for navigation. The construction is thusdone iteratively. Our approach is composed of a navi-gation step called crawling to identify the various pos-sibilities to navigate through the site and the associatednavigation states and a step for identifying and exploitingvulnerabilities as presented in Section 2.1.Figure 1 presents a high-level view of the proposed

approach. The crawling starts from the initial URL (whichcorresponds mostly to the main page of the Web appli-cation), after by deleting the cookies at the client side.This initialization is important for having different inde-pendent navigations. From this URL, combinatorial sitecrawling identifies the traces navigation list. Our approachis based on an exhaustive search to obtain all possiblenavigations of the site. The site is browsed starting withthe initial query and storing the requests sent to the site.The choice of the request to send is done by analyzing

Page 7: RESEARCH OpenAccess …approach can automatically identify and exploit vulner-abilities. It is also designed to highlight potential attack scenarios including the exploitation of several

Akrout et al. Journal of the Brazilian Computer Society 2014, 20:4 Page 7 of 16http://www.journal-bcs.com/content/20/1/4

Figure 1 Injection point extraction algorithm and vulnerabilitiesidentification.

the content of the page displayed. If this HTML page con-tains several links, one of these links is chosen to buildthe following query and the other links are stored for lateranalysis. As the crawling of the Web site may be infinite,a threshold indicating the maximum exploration depth ofthe site has to be specified. The corresponding value is aninput parameter of the algorithm. If the maximum depthis reached or if no requests can be sent from the cur-rent reached state, i.e., the current page no longer containsHTML links, the requests sequence stored from the initialstate corresponds to a site navigation. Then, the processrestarts from the beginning trying new navigations, basedon the stored choices. At the end, the set of sequences ofrequests representing all site navigations is obtained.On some complex Web sites, the number of requests

sent while crawling may be huge. The potential numberof injection points is even more important. Hence, it ismore appropriate to analyze vulnerabilities on a compactrepresentation rather than directly on the individual nav-igations. One way to obtain a compact representation isto minimize the navigation graph built using the set ofrequest sequences representing all site navigations. Thefirst version of the navigation graph does not contain anyedge corresponding to vulnerabilities.The construction of the minimal graph from the set

of navigations and the associated sequence of requests issimilar to a grammatical inference problem whose objec-tive is to find a minimal automaton that represents alanguage from symbol sequences of this language (so-called words). In this analogy, the automaton correspondsto the navigation graph and the symbols correspond torequests. As a language may include an infinite numberof words, the algorithm must be able to run based on a

subset of the words of a language. Two categories of gram-matical inference algorithms exist: (1) those that infer thelanguage only from sequences of words that belong to thelanguage and (2) those that consider all the sequences ofwords. More details can be found in [14]. The RPNI algo-rithm (Regular PositiveNegative Inference) [15] we chosebelongs to the second category. This widely used algo-rithm presents a polynomial time-based complexity and isquite simple to implement.At the end of the crawling step, we note that the only

possibility to enrich the graph is to identify vulnerabili-ties that if successfully exploited, could lead to add newedges and nodes to the navigation graph. Such vulner-abilities can be identified using the approach detailedin Section 2.1. This algorithm allows the identificationand effective exploitation of existing vulnerabilities. Thisexploitation leads to the discovery of new pages that, inturn, may contain new injection points that were not avail-able at the first step. Therefore, new possibilities for theexploitation of vulnerabilities are available. Subsequently,the approach is re-executed iteratively by including newpages, which leads to the construction of several naviga-tion graphs until satisfying the stopping criterion, which isspecified by the maximum navigation depth. The graphsincluding edges corresponding to the exploitation of avulnerability are so-called vulnerability graphs.Figure 2 pictures this iterative approach. Red edges iden-

tify vulnerabilities whose exploitation reveals new statesand edges of the navigation graph that were initially inac-cessible.This algorithm has been implemented in a software tool

using the Python language, which greatly facilitates thehandling of HTTP concepts (cookies, settings, etc.). Thistool is interfaced with the statistical analysis software R(http://www.r-project.org/) which integrates a set of clus-tering programs that we detailed in Section 2.1. They havebeen used to develop our classification algorithm. Thistool is called Wasapy, which stands for Web ApplicationSecurity Assessment in Python.

Figure 2 Iterative construction of the navigation graphs.

Page 8: RESEARCH OpenAccess …approach can automatically identify and exploit vulner-abilities. It is also designed to highlight potential attack scenarios including the exploitation of several

Akrout et al. Journal of the Brazilian Computer Society 2014, 20:4 Page 8 of 16http://www.journal-bcs.com/content/20/1/4

2.2.3 ExampleIn order to illustrate our approach, we developed an e-commerce Web site for buying books, using the PHPlanguage and a MySql database. This site is a simple proofof concept but it uses technologies and a structure simi-lar to ‘real’ Web sites. Figure 3 presents the HTML pagegraph describing the structure of the Web site. A page isrepresented by an icon. An edge between two pages cor-responds to the existence of an HTML link in the sourcepage leading to the second page. Let us note that a partic-ular reflexive link exists for the display.php page. Thispage enables to list the available books and includes a fil-tering function in a particular form field. A user may thenenter a regular expression in this field and submit it to thesite, in order to update the list of books.This site includes three vulnerabilities. The pages

including these vulnerabilities are identified by a star onFigure 3. The first vulnerability is associated to the pagelogin.php. The exploitation of this vulnerability allowsan attacker to bypass the authentication thanks to an SQLinjection. The second vulnerability is associated to thepage display.php. It allows an attacker to downloadthe content of the database. The last vulnerability asso-ciated to the page check.php allows an attacker to paythe products he ordered without providing any credit cardnumber. This vulnerability cannot be exploited unlesssome products have been added to the virtual shoppingcart.First, we consider a non-malicious user, who does not

have any account on the site. The only actions that the usercan do are

• access the index.html page• fill in the form with the authentication information in

the login.php page• get information from the about.html page.

The list of navigations that can result from these actionsare as follows:

• P1: index.html• P2: index.html → about.html• P3: index.html → login.php• P4: index.html → login.php →

index.html.

All these navigations can be synthesized by the naviga-tion graph in Figure 4. In this graph, the edges correspondto links of the site (html page or php, etc.), the nodes cor-respond to navigation states. The set of edges leaving anode is the set of links accessible from the correspondingnavigation state. This set is independent from previouslyactivated links used to achieve this navigation state.This navigation graph is very simple because the possi-

ble actions without valid (login/password) are limited.However, if we consider an attacker who is able to exploitsome vulnerabilities, then he is able to perform moreactions than an unregistered benign user. Therefore, theassociated navigation graph is a richer version of the graphin Figure 4, new edges and new nodes may appear.This is precisely what the next step serves for. The edges

of the graph 4 are tested to identify vulnerabilitiesd. Theexploitation of a new vulnerability may change the naviga-tion state. Thus, it can lead to the insertion of a new nodein the graph. At this stage, the only available vulnerabilitycorresponds to an SQL injection for the authentication,i.e., in the login.php page. The result is depicted inFigure 5.During the second iteration, we identify pages that can

be accessible after the exploitation of the vulnerabilityidentified during the previous iteration. To reach thesepages, it is necessary to cross the edges index.html andlogin.php. The set of traces executed for the second

Figure 3 Structure of the Web site.

Page 9: RESEARCH OpenAccess …approach can automatically identify and exploit vulner-abilities. It is also designed to highlight potential attack scenarios including the exploitation of several

Akrout et al. Journal of the Brazilian Computer Society 2014, 20:4 Page 9 of 16http://www.journal-bcs.com/content/20/1/4

Figure 4 Navigation graph of a non-authenticated user.

iteration includes 65 traces. These traces reach the fol-lowing files, display.php, add.php, delete.php,buy.php, or check.php. Then, the vulnerability iden-tification phase is re-executed once for each new edgegenerated during of the second iteration.We iterate in thisway until we get the final graph that covers the entire siteas shown in Figure 6. In the case of this example, the algo-rithm stops after six iterations considering a maximumdepth of navigation set to 7. This means that there are nomore vulnerabilities discovered during the sixth iteration.To automate the process, we have implemented algo-

rithms corresponding to the two main steps of ourapproach: the crawling and the vulnerabilities discovery.These algorithms are presented in next section.

2.3 AlgorithmsThis section presents the algorithms that we have devel-oped to implement the approach described in the previoussection.The search_vulns function in Algorithm 1 takes a nav-

igation as input parameter. The latest request of thisnavigation is analyzed to identify vulnerabilities, consider-ing different vulnerability classes (SQL injections, XPATHinjections, OS commanding, etc.). This function returnsa list of navigations and, for each of them, one of theidentified vulnerabilities. Each new navigation includesone more navigation state that results from the exploita-tion of one vulnerability. These new navigations are thenanalyzed by the crwl function.

Algorithm 1 Vulnerability identification algorithm -search_vulnsRequire: path = navigationEnsure: vulns = set of navigations1: vulns ← ∅2: for class ∈ vuln_classes do3: vulns ← vulns ∪ wasapy(path, class)4: end for5: return vulns

The crwl function is presented in Algorithm 2. Thisfunction is used to continue the crawling of the Website starting from a specified navigation provided as aninput parameter. The remain variable holds the set of

Figure 5 Vulnerability graph of first iteration.

navigations that have just been discovered but not beencrawled yet. This algorithm ends when there are no morenavigations to crawl, which means that the remain setis empty, or when the maximum exploration depth isreached. Before starting the crawling of any navigation,the cookies are removed. Then, each request of the nav-igation is executed step by step, from the first one to thelast one. The content of the response associated to this lastrequest is analyzed in order to identify new HTML links.These are provided by the get_response function (line 9 ofthe Algorithm 2). The analysis is only made for this lastrequest because all the requests of the navigation exceptthe last one have already been analyzed during previousiterations of the algorithm. Each HTML link identified inthis response is used to build a new navigation of the site(using the concatenation operator ⊕). This operation isrepeated iteratively until all the HTML links have beendiscovered or until the maximum exploration depth isreached. The set of the new navigations discovered by crwlis saved in the traces variable.

Algorithm 2 Crawling algorithm - crwlRequire: path, dmEnsure: new_paths = traces1: remain ← {path}2: traces ← ∅3: d ← |path|4: while remain �= ∅ ∧ d ≤ dm do5: next ← ∅6: for trace ∈ remain do7: free_cookies()8: for i ∈ 1..(|trace| − 1) do9: get_response(tracei)

10: end for11: links ← get_response(trace|trace|)12: for link ∈ links do13: next ← next ∪ {trace ⊕ link}14: traces ← traces ∪ {trace ⊕ link}15: end for16: end for17: remain ← next18: d ← d + 119: end while20: return traces

Page 10: RESEARCH OpenAccess …approach can automatically identify and exploit vulner-abilities. It is also designed to highlight potential attack scenarios including the exploitation of several

Akrout et al. Journal of the Brazilian Computer Society 2014, 20:4 Page 10 of 16http://www.journal-bcs.com/content/20/1/4

Figure 6 Final vulnerability graph.

The main function of Algorithm 3 executes the twoprevious functions. During the first iteration, the crwlfunction discovers the Web site considering only ‘normal’requests. Then, thanks to the RPNI algorithm, a graphis built from the navigations obtained. Each state of thegraph is then analyzed in order to identify vulnerabilities(search_vulns function). At the end of the first iteration,all the navigations including at most one vulnerability areidentified. The exploitation of these vulnerabilities mayenable to discover new parts of the Web site. Then, thenext iteration begins. The beginning of each iteration iof the main function corresponds to the exploration of asub-part of the site, more precisely the part that has beendiscovered thanks to the exploitation of the (i − 1)th vul-nerability of the navigation. At the end of the crawlingof each iteration i, the vulnerability identification step iscarried out, considering navigations including i − 1 vul-nerabilities and ending with a normal request. At the endof the ith iteration, all the navigations including at mosti vulnerabilities are obtained. Finally, the main algorithmstops when the maximum exploration depth is reached orwhen no further vulnerability can be identified.

Algorithm 3 Main algorithmRequire: urlsEnsure: (G = (S,N ,R), vulns)1: G ← RPNI(urls)2: ntraces ← urls3: traces ← urls4: vulns ← ∅5: while |ntraces| �= 0 do6: for nt ∈ ntraces do7: if |nt| < dm then8: traces ← traces ∪ crwl(nt, dm)

9: end if10: end for11: H ← RPNI(traces)12: new_nodes ← H .N \ G.N13: ntraces ← ∅14: for nn ∈ new_nodes do15: ptnn ← shortest_path(H .R,H .S, nn)

16: if |ptnn| < dm then17: nptv ← search_vulns(ptnn)

18: for np ∈ nptv do19: new_vuln ← np|np|20: vulns ← vulns ∪ {new_vuln}21: end for22: ntraces ← ntraces ∪ nptv23: end if24: end for25: G ← H26: end while27: return (G, vulns)

Page 11: RESEARCH OpenAccess …approach can automatically identify and exploit vulner-abilities. It is also designed to highlight potential attack scenarios including the exploitation of several

Akrout et al. Journal of the Brazilian Computer Society 2014, 20:4 Page 11 of 16http://www.journal-bcs.com/content/20/1/4

3 Results and discussionThis section presents the experiments that we have car-ried out to validate and assess our algorithm. We haveconsidered several applications using Wasapy and thethree open-source vulnerability scanners discussed in thispaper: W3af 1.1, Skipfish 1.9.6b, and Wapiti2.2.1. The experiments are run on a Gnu/Linux (2.6kernel) host running several virtual machines thanksto the VirtualBox utility. All the virtual machinesrun the Apache Web server 1.3.37 or 2.2.8with PHP 4.0.0 or 5.0.0 and MySQL databaseserver 5.This section is organized as follows. Section 3.1 presents

conventions and abbreviations. Section 3.2 presents thefirst experiments carried out in order to assess ourapproach. Five Web applications including SQL vulnera-bilities are used. We purposely injected these vulnerabili-ties to calibrate Wasapy. Section 3.3 presents the secondset of experiments with vulnerable off-the-shelf applica-tions, without any modification of these applications. Thissubsection compares Wasapy to other vulnerability scan-ners on non-purposely injected vulnerabilities. For someof these applications, evaluation reports based on com-mercial scanners are available in [16]. We reported someof these results in order to compare these scanners withWasapy. Section 3.4 presents the summary of all theseexperiments.

3.1 NotationsThe results of our experiments are presented in differenttables. We use the following conventions and abbrevia-tions:

✓ The vulnerability has been detected by thecorresponding scanner

✗ The vulnerability has not been detected by thecorresponding scanner

− The injection point is not tested by the scannerSQLi stands for SQL InjectionXPa stands for XPath InjectionOsC stands for OS CommandingFIn stands for File Include

CVE reports the CVE reference of the consideredvulnerability if it exists

NR The vulnerability does not have a CVE

A vulnerability is considered as detected if the scan-ner actually sends an alert for this vulnerability, whateverthe method used to detect it. A vulnerability is consid-ered as not detected if the scanner actually tested thecorresponding injection point without sending any alert.A vulnerability is considered as ignored by the scannerif the corresponding injection point is not tested by thescanner.

3.2 Experiments with modified applicationsThe five applications chosen for this first set of experi-ments are described hereafter:

• phpBB-3: This application (http://www.phpbb.com)is a forum manager written in PHP and using aMySQL database. We modified the authenticationform of the application by inserting a vulnerability(v1) that can be exploited by an SQL injection. Thisvulnerability allows an attacker to reach the restrictedadministration area of the forum.

• SecurePage: This application (http://www.01php.com/fiche-scripts-126.html) written in PHP, isdesigned to protect the access of a Web site throughauthentication. Valid pairs for this authentication arestored in a MySQL database. A vulnerability (v2)similar to v1 was purposely injected.

• HardwareStore: We developed this application, inPHP 5.0. This application allows a user to inventorycomputer equipments in a database and tointerrogate this database. The user needs first to beauthenticated. Five SQL vulnerabilities were injectedin this application. v3 allows SQL injection in asearch form and allows an attacker to access thewhole database. v4 allows SQL injection in theauthentication form. v5 allows SQL injection in aparameter of an HTML request. For this vulnerableHTML page, we have purposely disabled the errormessage reporting, in order to compare the behaviorof W3af and Wapiti in such a situation with thebehavior of Wasapy. Vulnerability v6 is similar tov4 but it is used in a different context: the errormessage reporting is deactivated. Vulnerability v7can only be exploited after the successful exploitationof v4. Indeed, this vulnerability is included in a pagethat can only be accessed after successfulauthentication on the application or after a successfulbypass of the authentication mechanism (throughexploitation of v4). XPATH, OS Commanding, andFile Include vulnerabilities were also injected in thisapplication. Vulnerability v10, in the authenticationpage, allows an attacker to bypass the authenticationthrough a XPATH injection. v11 is an OsCommanding vulnerability that can be exploited onlyafter v4 is successfully exploited. Indeed, thisvulnerability is included in a page that is onlyaccessible after authentication (or bypass of theauthentication through successful exploitation ofv4). Vulnerability v12 is a File Include vulnerability,it is inserted in the same page as v11 and can beexploited in the same conditions as v11.

• Insecure: This application was developed in Rubyon Rails in the context of the Dali projecte. It is ane-commerce site, including user sessions through

Page 12: RESEARCH OpenAccess …approach can automatically identify and exploit vulner-abilities. It is also designed to highlight potential attack scenarios including the exploitation of several

Akrout et al. Journal of the Brazilian Computer Society 2014, 20:4 Page 12 of 16http://www.journal-bcs.com/content/20/1/4

Table 1 Summary of a list of sets

Vulnerabilities Scanners

Type Application ID Skipfish W3af Wapiti Wasapy

phpBB3 v1 ✗ ✗ ✓ ✓

SecurePages v2 ✗ ✗ ✓ ✓

v3 ✓ ✓ ✓ ✓

v4 ✓ ✓ ✗ ✓

SQLi HardwareStore v5 ✓ ✗ ✗ ✓

v6 ✗ ✗ ✗ ✓

v7 − − − ✓

Insecure v8 ✓ ✓ ✗ ✓

DVWA v9 ✓ ✓ − ✓

XPa HardwareStore v10 ✗ ✗ ✗ ✓

OsC HardwareStore v11 − − − ✓

FIn HardwareStore v12 − − − ✓

Number of detections 5 4 3 12

virtual shopping carts. A vulnerability (v8), whichallows an attacker to inject SQL code, was purposelyincluded in the authentication form of the application.This vulnerability, functionally equivalent to v4, isanyway different because Insecure is implementedin Ruby and the error reporting messages differ fromthe Apache error reporting messages.

• Damn Vulnerable Web Application (DVWA): Thisapplication (http://www.dvwa.co.uk) is written inPHP and uses MySQL server. A vulnerability v9,similar to v3, was introduced in the application.

Table 1 shows that the performances of W3af andWapiti are similar in average, even if the vulnerabilitiesdetected are not the same (Wapiti successfully detectsv1 and v2 whereas W3af does not detect them; on theother hand, W3af detects v4 and v8 whereas Wapitidoes not detect them). This result is consistent with thefact that both scanners use a patternmatching-based algo-rithm. The observed variations are related to the genera-tion of different requests by these tools. Wasapy allowsus to detect all these vulnerabilities. This confirms that

Table 2 Vulnerability detection results for Cyphorapplication

Vulnerability

Type CVE Location Skipfish W3af Wapiti Wasapy

NR search.php ✓ ✓ ✓ ✓

SQLi 2005-3236 lostpwd.php ✓ ✓ ✓ ✓

2005-3236 newmsg.php ✓ ✓ ✓ ✓

2005-3575 show.php ✓ ✓ ✓ ✓

False positive 1 0 0 0

the vulnerability detection clustering algorithm presents abetter coverage than the pattern matching algorithm forthese vulnerability classes.Regarding vulnerabilities v1 and v2, we manu-

ally checked the injections performed by Skipfish(’", \’\" and \\’\\") and stored the correspond-ing responses (respectively A, B et C). As discussed inSection 1.2, Skipfish considers that A and C must bedifferent so that a vulnerability is present. Unfortunately,for these two injection points, this is not the case. Theresponses correspond to SQL error messages that are verysimilar.Regarding vulnerabilities v5 and v6, they are included

in PHP pages for which we purposely deactivated theerror reporting message featuref in the configuration fileof PHP5. In this particular case, none of the three scan-ners (Skipfish, W3af, and Wapiti) is able to detectvulnerabilities.Regarding v7, Wasapy is the only scanner that is able

to detect it. Moreover, it is the only scanner that isable to test the corresponding injection point. Indeed,

Table 3 Vulnerability detection results for Seagullapplication

Vulnerability

Type CVE Location Skipfish W3af Wapiti Wasapy

SQLi 2010-3212 index.php ✗ ✗ ✗ ✓

2010-3209 container.php ✗ ✗ ✗ ✗

FIn 2010-3209 QuickForm.php ✗ ✗ ✗ ✗

2010-3209 NestedSet.php ✗ ✗ ✗ ✗

2010-3209 Output.php ✗ ✗ ✗ ✗

False positive 0 0 0 0

Page 13: RESEARCH OpenAccess …approach can automatically identify and exploit vulner-abilities. It is also designed to highlight potential attack scenarios including the exploitation of several

Akrout et al. Journal of the Brazilian Computer Society 2014, 20:4 Page 13 of 16http://www.journal-bcs.com/content/20/1/4

Table 4 Vulnerability detection results for Fttss application

Vulnerability

Type CVE Location Skipfish W3af Wapiti Wasapy AppScan WebInspect Acunetix

OsC NR index.php ✗ ✓ ✗ ✓ ✗ ✗ ✗

False positive 0 0 0 0 0 0 0

this injection point is included in an HTML page thatcan only be accessed after a successful authenticationor after the successful exploitation of vulnerability v4.As Wasapy is the only scanner able to actually exploitv4, it can automatically access the page including vul-nerability v7. For the other scanners, it is necessary tomanually perform the exploitation of v4 so that it is pos-sible to access the page including v7. Vulnerabilities v11and v12 were identified only by our tool for the samereasons: they remain masked until the authentication isbypassed.The purpose of these initial tests was the calibration

of Wasapy. The calibration of our tool consists in defin-ing empirically the number of requests to generate foreach group and injection point. We set this number to 30for all the applications tested (i.e., 90 requests per injec-tion point). We have observed that a higher number doesnot provide significantly higher accuracy, while a lowernumber generates false negatives.These initial tests also allowed us to check the gram-

mars that we presented in the previous section. Of course,the corresponding vulnerabilities have been identifiedfor this purpose. So, these results are not aimed tobe used to make an absolute comparison between thescanners. A more representative comparative assessmentof the different tools should be based on vulnerableapplications in which vulnerabilities have not been delib-erately injected by ourselves. These experiments are pre-sented in the next subsection.

3.3 Experiments with non-modified vulnerableapplications

This second set of experiments allowed us to have amore precise idea of the coverage of our detection

algorithm. For that purpose, we compared it to thedetection algorithms of Skipfish, W3af, and Wapition non-purposely modified vulnerable Web applica-tions. For some of these applications, we could compareour algorithm with some commercial vulnerability scan-ners, considering the results available in [16]. In thisdocument, the author presents the vulnerability detec-tion results obtained with three commercial scanners:WebInspect from HP, AppScan from IBM, and WebVulnerability Scanner from Acunetix. Theseresults provide only some preliminary indications to ana-lyze the performance of our tool on the same set ofapplications and are not meant to be used for a validationpurpose.For our experiments, we selected five Web applications

(most of them tested in [16]), known to include vulnera-bilities. These applications cover different functionalitiesand execution contexts. We installed these applicationsand performed vulnerability detection tests without mod-ifying them.

• Cyphor (http://Webscripts.softpedia.com/script/Snippets/Cyphor-27985.html) is a configurationWebforum, which uses PHP 4.0.0 session capabilitiesto authenticate users and a MySQL database.

• Seagull (http://seagullproject.org/) is an OOPframework for building Web, command line, andGUI applications. This project allows PHP developersto integrate and manage code resources, and buildcomplex applications. This application requires thefollowing configuration: PHP 4.3.0 or newer, MySQL4.0.x or newer, Apache 1.3.x or 2.x.

• Fttss is a research project (http://fttss.sourceforge.net) that implements a Text-To-Speech System basedon PHP (4.3.0 or newer) and MySQL (4.1.2 or newer).

Table 5 Vulnerability detection results for Riotpix application

Vulnerability

Type CVE Location Skipfish W3af Wapiti Wasapy AppScan WebInspect Acunetix

NR edit_post.php ✗ ✗ ✗ ✓ ✗ ✗ ✗

NR edit_post_script.php ✗ ✗ ✗ ✗ ✗ ✗ ✗

SQLi NR index.php ✗ ✗ ✗ ✗ ✗ ✗ ✗

NR message.php ✗ ✗ ✗ ✓ ✗ ✗ ✗

NR reader.php ✓ ✓ ✗ ✓ ✗ ✗ ✗

False positive 0 0 0 0 0 0 0

Page 14: RESEARCH OpenAccess …approach can automatically identify and exploit vulner-abilities. It is also designed to highlight potential attack scenarios including the exploitation of several

Akrout et al. Journal of the Brazilian Computer Society 2014, 20:4 Page 14 of 16http://www.journal-bcs.com/content/20/1/4

• Riotpix (http://www.riotpix.com/) is anopen-source discussion forum for the Web based onPHP (4.3.0 or newer) and MySQL (4.1.2 or newer).

• Pligg (http://www.pligg.com/) is a socialnetworking open-source CMS (Content ManagementSystem) that permits visitors to register on theWebsite, submit content and connect with otherusers. This software creates Websites where storiesare created and voted on by members. PHP (4.3.0 ornewer) and MySQL (4.1.2 or newer) are required.

We inspected manually the results provided by eachscanner to have more confidence on the number ofdetected vulnerabilities and false-positives.Table 2 presents the results for Cyphor application. All

the scanners detected all the vulnerabilities because errormessages are reported to the client. Thus, it is easy to dis-tinguish successful vulnerability exploitation from errormessages. The underlined results correspond to detec-tions made possible by supplying a valid (login/password)to the scanners to perform authentication. In otherwords, the corresponding vulnerability is only visiblewhen logged in the site (the authentication page doesnot contain any SQL-injection vulnerability, it is the onlyway for any scanner to access the page including thevulnerability).The results reported for Seagull in Table 3 show that

Wasapy is the only one that reports a vulnerability inthis application. Others are unable to do so because theapplication does not report errors to the client. Regard-ing File include vulnerabilities, the injection points whichallow their exploitation are not directly accessible fromthe client interface. Hence, the source code is necessary toidentify these vulnerabilities. This explains the failure ofall scanners.

Fttss is an application that has been tested in [16].Hence, some results associated to the three commer-cial scanners considered are available (cf. Table 4). Thecommercial scanners do not detect the OS command-ing vulnerability, which is the only vulnerability knownof this application. In contrast, W3af and Wasapy areable to identify this vulnerability. It is noteworthy thatnone of tested scanners reports false positives in thiscase.Regarding Riotpix (cf. Table 5), the results are similar

to those of Cyphor. The vulnerabilities are only accessi-ble to successfully authenticated users. Therefore, we hadto provide a valid login/password to all scanners. Twovulnerabilities have not been found by any scanner. Theycorrespond to code injection into variables that are notvisible to the client and thus cannot be discovered byscanners (their identification would require a source codeanalysis). These results also show that Wasapy is efficientfor this kind of vulnerability.Regarding the Pligg application (cf. Table 6), all vul-

nerabilities but the first two are available on hidden injec-tion points. The scanner must be aware of the presenceof the injection point in order to test the vulnerability. Forthe first two vulnerabilities, Wasapy found them, whereasthe other scanners found only one of these vulnerabili-ties. This is due to the fact that error messages are notforwarded to the client.

3.4 SummaryThe main lessons learned from all our experiments aresummarized in the following:

• Wasapy is an efficient scanner, especially inparticular conditions for which it has been designed:(1) it is more efficient than the other freeware

Table 6 Vulnerability detection results for Pligg application

Vulnerability

Type CVE Location Skipfish W3af Wapiti Wasapy AppScan WebInspect Acunetix

2008-7091 login.php ✗ ✗ ✗ ✓ ✗ ✓ ✗

2008-7091 story.php ✓ ✗ ✓ ✓ ✓ ✓ ✓

NR userrss.php ✗ ✗ ✗ ✗ ✓ ✓ ✓

2008-7091 out.php ✗ ✗ ✗ ✗ ✓ ✗ ✓

2008-7091 trackback.php ✗ ✗ ✗ ✗ ✗ ✗ ✗

SQLi 2008-7091 cloud.php ✗ ✗ ✗ ✗ ✗ ✗ ✗

2008-7091 cvote.php ✗ ✗ ✗ ✗ ✗ ✗ ✗

2008-7091 recommend.php ✗ ✗ ✗ ✗ ✗ ✗ ✗

2008-7091 submit.php ✗ ✗ ✗ ✗ ✗ ✗ ✗

2008-7091 vote.php ✗ ✗ ✗ ✗ ✗ ✗ ✗

2008-7091 edit.php ✗ ✗ ✗ ✗ ✗ ✗ ✗

False positive 0 0 0 2 1 1 0

Page 15: RESEARCH OpenAccess …approach can automatically identify and exploit vulner-abilities. It is also designed to highlight potential attack scenarios including the exploitation of several

Akrout et al. Journal of the Brazilian Computer Society 2014, 20:4 Page 15 of 16http://www.journal-bcs.com/content/20/1/4

scanners tested when the error reporting is disabled,(2) it is more efficient than the other scanners todiscover and exploit vulnerabilities that are includedin pages not directly accessible (pages that require thesuccessful exploitation of a vulnerability to beaccessed). Indeed, Wasapy is the only one which iscapable of actually exploiting the vulnerability andsupplying the exact corresponding injection requests.

• Wasapy is globally as efficient as the othervulnerability scanners tested on non-modifiedvulnerable applications.

• Our clustering algorithm can be easily adapted todifferent kinds of vulnerabilities. Besides SQLinjections, the results of the experiments show thatWasapy also detects XPATH, OS Commanding andFile Include vulnerabilities and that it is at least asefficient as the other vulnerability scanners.

4 ConclusionIn this paper, we proposed a new methodology that isdesigned to automatically identify Web applications vul-nerabilities and to exhibit attack scenarios targeting theseapplications. This methodology is based on the dynamicanalysis of the application following a black box approach.It is also aimed at reducing the number of false posi-tives by providing the queries that allow the successfulexploitation of the detected vulnerabilities. This advan-tage is twofold since the effective exploitation of vul-nerabilities also allows us to discover new pages in theWeb application that we could not reach before. Thesenew pages may contain new injection points and pos-sibly new vulnerabilities. To validate and evaluate ourapproach, we have carried out two sets of experimentson different types of applications. This approach led alsoto the development of a new vulnerability scanner calledWasapy.Various directions will be considered for extending

the results obtained so far. First, regarding the pro-posed approach for detecting vulnerabilities and gener-ating attack scenarios based on the elaboration of theWebsite navigation graph, optimisations would be neces-sary to master the size of the graph, especially when itis to be applied to complex Web sites. Another perspec-tive would be to enrich the grammars implemented inWasapy to allow the generation of a larger variety forinjections covering the vulnerabilities included so far, aswell as new vulnerabilities.

EndnotesaSee [9] for a more detailed description of these

vulnerabilities.bBNF stands for Backus Normal Form. It is a notation

for grammar writing.

cThe cookie stores at the client side a variety ofinformation about the running session (keys, contents,user preferences, etc.).

dFor simplicity, only the HTML links activated by therequest appear, without explicitly specifying theparameters associated with these requests.

eFrench funded ANR’s program ARPEGE(2009-2011).fThe configuration file of PHP5 includes - For

production Web sites, you are strongly encouraged toturn this feature off, and use error logging instead.

Competing interestsThe authors declare that they have no competing interests.

Authors’ contributionsAll authors have contributed to the different conceptual and experimentalaspects of the approach presented in this paper. All authors read andapproved the final manuscript.

AcknowledgementsThis work was supported by the Agence Nationale de la Recherche throughproject DALI and by the french project Secured Virtual Cloud.

Received: 6 July 2013 Accepted: 30 October 2013Published: 23 January 2014

References1. IBM X-Force (2012) Mid-year Trend and Risk Report, September 2012.

http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=SA&subtype=WH&htmlfid=WGL03014USEN

2. Alata E, Kaaniche M, Nicomette V, Akrout R (2013) An automatedvulnerability-based approach for web applications attack scenariosgeneration. In: LADC-2013: Latin-American Symposium on DependableComputing, Rio De Janeiro, 02–05 April 2013, p 9

3. Dessiatnikoff A, Akrout R, Alata E, Kaaniche M, Nicomette V (2011) Aclustering approach for web vulnerabilities detection. In: IEEE Pacific RimInternational Symposium on Dependable Computing (PRDC 2011),Pasadena, 12–14 December 2011, p 10

4. Stefan K, Kirda E, Kruegel C, Jovanovic N (2006) SecuBat: a Webvulnerability scanner. In: Proc. of the 15th Int. conf. on World Wide Web(WWW ’06), Edinburgh, 23–26 May 2006

5. Huang YW, Huang SK, Lin TP, Tsai CH (2003) Web application securityassessment by fault injection and behavioral monitoring. In: Proc. 12thInt. Conf. on World Wide Web (WWW’03), Budapest, 20–24 May 2003

6. Fonseca J, Vieira M, Madeira H (2007) Testing and Comparing Webvulnerability scanning tools for SQL injections and XSS attacks. In: Proc.2007 IEEE Symposium Pacific Rim Dependable Computing (PRDC 2007),Melbourne, 17-19 December 2007, pp 330–337

7. Bau J, Bursztein E, Gupta D, Mitchell J (2010) State of the art: Automatedblack-box Web application vulnerability testing. In: Proc. 2010 IEEESymposium on Security and Privacy, Oakland, 16–19 May 2010

8. Doupé A, Cova M, Vigna G (2010) Why Johnny can’t pentest: An analysisof black-box Web vulnerability scanners. In: Proc. DIMVA 2010, Bonn,8–9 July 2010

9. Akrout R (2010) Web Applications Vulnerability Analysis and IntrusionDetection Systems Assessment. PhD Thesis, University of Toulouse,October 2012 (in French). http://homepages.laas.fr/rakrout/PhD_Thesis.pdf

10. Levenshtein V (1965) Leveinshtein distance. http://en.wikipedia.org/wiki/Levenshtein_distance. Accessed on 22 February 10

11. Johnson SC (1967) Hierarchical clustering schemes. Psychometrika J 2:241–254

12. Kiezun A, Guo PJ, Jayaraman K, Ernst MD (2009) Automatic creation ofSQL Injection and cross-site scripting attacks. In: Software Engineering,2009. ICSE 2009. IEEE 31st International Conference on Vancouver,29–31 August 2009

Page 16: RESEARCH OpenAccess …approach can automatically identify and exploit vulner-abilities. It is also designed to highlight potential attack scenarios including the exploitation of several

Akrout et al. Journal of the Brazilian Computer Society 2014, 20:4 Page 16 of 16http://www.journal-bcs.com/content/20/1/4

13. Gutesman E (2008) gFuzz: An instrumented web application fuzzingenvironment. In: Hack.Lu ’08, Luxembourg, 22–24 October 2008

14. Dupont P (1994) Regular grammatical inference from positive andnegative samples by genetic search: the GIG method. In: Proc. of the 2ndIntl. Colloquium on Grammatical Inference and Applications (ICGI ’94),Alicante, 21–23 September 1994, pp 236–245

15. Dupont P (1996) Incremental regular inference. In: Proc. of the Fourth Intl.Colloquium on Grammatical Inference and Applications (ICGI ’96),Montpellier, 25–27 September 1996, pp 222–237

16. http://anantasec.blogspot.com/. Accessed 09 December 2010

doi:10.1186/1678-4804-20-4Cite this article as: Akrout et al.: An automated black box approach for webvulnerability identification and attack scenario generation. Journal of theBrazilian Computer Society 2014 20:4.

Submit your manuscript to a journal and benefi t from:

7 Convenient online submission

7 Rigorous peer review

7 Immediate publication on acceptance

7 Open access: articles freely available online

7 High visibility within the fi eld

7 Retaining the copyright to your article

Submit your next manuscript at 7 springeropen.com