ijest10-02-07-39

A.V. Krishna Prasad et. al. / International Journal of Engineering Science and Technology Vol. 2 (7), 2010, 2667-2677

Data Mining for Secure Software Engineering – Source Code Management Tool Case Study

A.V.Krishna Prasad* Dr.S.Rama Krishna1

* Associate Professor Department of Computer Science MIPGS Hyderabad A.P. India Email: [email protected]

1 Professor Department of Computer Science S.V.University Tirupathi A.P. India Email: [email protected]

Abstract

As Data Mining for Secure Software Engineering improves software productivity and quality, software engineers

are increasingly applying data mining algorithms to various software engineering tasks. However mining software

engineering data poses several challenges, requiring various algorithms to effectively mine sequences, graphs and

text from such data. Software engineering data includes code bases, execution traces, historical code changes,

mailing lists and bug data bases. They contains a wealth of information about a projects-status, progress and

evolution. Using well established data mining techniques, practitioners and researchers can explore the potential of

this valuable data in order to better manage their projects and do produce higher-quality software systems that are

delivered on time and with in budget. Data mining can be used in gathering and extracting latent security

requirements, extracting algorithms and business rules from code, mining legacy applications for requirements and

business rules for new projects etc. Mining algorithms for software engineering falls into four main categories:

Frequent pattern mining – finding commonly occurring patterns; Pattern matching – finding data instances for given

patterns; Clustering – grouping data into clusters and Classification – predicting labels of data based on already

labeled data. In this paper, we will discuss the overview of strategies for data mining for secure software

engineering, with the implementation of a case study of text mining for source code management tool.

Keywords: Data Mining, Software Engineering, Source Code Management, Text Mining

1. INTRODUCTION

Data Mining for Software Engineering: To improve software productivity and quality, software engineers are increasingly applying data mining algorithms to various software engineering tasks [1 - 15]. However mining software engineering data poses several challenges, requiring various algorithms to effectively mine sequences, graphs and text from such data. Software engineering data includes code bases, execution traces, historical code changes, mailing lists and bug data bases. They contains a wealth of information about a projects-status, progress and evolution. Using well established data mining techniques, practitioners and researchers can explore the potential of this valuable data in order to better manage their projects and do produce higher-quality software systems that are delivered on time and with in budget. Data mining can be used in gathering and extracting latent security requirements, extracting algorithms and business rules from code, mining legacy applications for requirements and business rules for new projects etc.[1-5]

Mining algorithms for software engineering falls into four main categories:

Frequent pattern mining – finding commonly occurring patterns. Pattern matching – finding data instances for given patterns.

ISSN : 0975-5462 2667


Clustering – grouping data into clusters and Classification – predicting labels of data based on already labeled data.

Software engineering data can be broadly categorized into

Sequences such as execution traces collected at runtime, static traces extracted from source code, and co-changed code locations. Examples of mining algorithms used here are Frequent Item set /Sequence/ Partial-ordering mining, sequence matching/clustering/classification. Examples of software engineering tasks here are Programming, maintenance, bug detection and debugging.

Graphs such as dynamic call graphs collected at runtime and static call graphs extracted from source code; Examples of mining algorithms used here are Frequent Sub-graph mining, Graph matching/clustering/classification. Examples of software engineering tasks here are bug detection and debugging.

Text such as bug reports, e-mails, code comments, and documentation. Examples of mining algorithms used here are Text Matching/Clustering/Classification. Examples of software engineering tasks here are Maintenance, Bug Detection and Debugging.

2. RESEARCH PROBLEM DEFINITION The Research problems addressed here are: Mining software engineering data pertaining to Program source code mining implementation tools which improves software debugging and its related challenges. Strategies for debugging mining includes: Text Mining, Sequence Mining and Graph Mining.[6-10]

Key research questions addressed are:

1. What types of Software Engineering Data are available to be mined? 2. Which Software Engineering Tasks can be helped using Data Mining? 3. How Data Mining techniques are used in Software Engineering? 4. What are the Challenges in applying Data Mining techniques to Software Engineering data? 5. Which Data Mining techniques are most suitable for specific types of Software Engineering data?

The importance of software is increasing in scientific research and our daily life. Meanwhile, the cost and consequences of software failure caused by software bugs become more and more serious. This research emphasizes a standard process for data mining based software debugging. This proposed process provides guidelines for software testing engineers and researchers on how to apply data mining techniques and software testing theory on real life software testing projects. Data mining based software debugging projects is a five step process: Establish the software testing project; data collecting, cleaning and transformation; select, train and verify the data mining models; classify, locate and describe the software bug found in previous steps; and deploy the knowledge gained into real life software testing project.

3. OBJECTIVES OF THE RESEARCH WORK The objective of the research work to propose strategic Data Mining tools for program source code debugging which improves Software Reliability & Quality. The mining algorithms works on software engineering data like text, sequences, graphs : Which improves software engineering tasks like Programming; Maintenance; Bug Detection; Debugging: Bug Detection and debugging : Maintenance, Bug Detection & Debugging. Initially implementation of source code management tool is done and finally data mining tools are implemented for Debugging Open Web API Mining. [11-15]

Software engineers can start with either a problem driven approach, but in practice they commonly adopt a mixture of the first two steps: collecting/investigating data to mine and determining the SE tasks to assist. The three remaining steps are inorder, preprocessing data, adopting/adapting/developing a mining algorithm, and post processing applying mining results.

Processing data involved first extracting relevant data from the raw SE data – for example, static method call sequences or call graphs from source code, dynamic method call sequences or call graphs from execution traces, or word sequences from bug report summaries. This data is further processed by cleaning and properly formatting it for the mining algorithm. For example, the input format for sequence data can be a sequence database where each sequence is a series of events.

ISSN : 0975-5462 2668


The next step produces a mining algorithm and its supporting tool, based on the mining requirements derived in the first two steps. In general, mining algorithms fall into four main categories:

Frequent Pattern Mining: Finding commonly occurring patterns.

Pattern Matching: Finding data instances for given patterns.

Clustering – Grouping data into clusters

Classification- Predicting labels of data based on the already labeled data.

The final step transforms the mining algorithm results in to an appropriate format required to assist the SE task. For example, in the preprocessing step, a software engineer replaces each distinct method call with a unique symbol in the sequence data base being fed on to the mining algorithm. The mining algorithm then characterizes a frequent pattern with these symbols. In post processing, the engineer changes each symbol back to the corresponding method call. When applying frequent pattern mining, this step also includes finding locations that match a mined pattern – for example, to assist in programming or maintenance – and finding locations that violate a mined pattern – for example, to assist in bug detection.

Text Mining Tool for Program Source Code Debugging

We can implement Neglected conditions are an important but difficult-to-find class of software defects. This approach presents a novel approach for revealing neglected conditions that integrates static program analysis and advanced data mining techniques to discover implicit conditional rules using the novel approach for revealing neglected conditions that integrates static program analysis Data mining for legacy requirements: As of now, more than half of “new” applications are replacements for aging legacy software applications. Some of these legacy applications may have been in continuous use for more than 25 years. Unfortunately, the software industry is lax in keeping requirements and design documents up to date, so for a majority of legacy applications, there is no easy way to find out what requirements need to be transferred to the new replacement. However, some automated tools can examine the source code of legacy applications and extract latent requirements embedded in the code. These hidden requirements can be assembled for use in the replacement application. They can also be used to calculate the size of the legacy application in terms of function points, and thereby can assist in estimating the new replacement application. Latent requirements can also be extracted manually using formal code inspections, but this is much slower than automated data mining. Most Software engineering data mining studies rely on well-known, publicly available tools such as association rule mining and clustering. Such black-box reuse of mining tools may compromise the requirements unique to software engineering by fitting them to the tools undesirable features. Further, many such tools are general purpose and should be adapted to assist the particular task at hand. However, Software engineering researchers may lack the expertise to adapt or develop mining algorithms or tools, while data mining researchers may lack the background to understand mining requirements in the software engineering domain. On promise way to reduce this gap is to foster close collaborations between the software engineering community (requirement providers) and data mining community (solution providers). This research effort represents one such instance. Writing Requirements is a two way process, classified as Functional Requirements (FR) and Non-Functional Requirements (NFR) statements from Software Requirements Specification (SRS) documents. This is systematically transformed into state charts considering all relevant information. The test cases can be used for automated or manual software testing on system level. A method for reduction of test suite by using mining methods there by facilitating the mining and knowledge extraction from test cases.

4. IMPLEMENTATIONS AND VALIDATIONS Text Mining Source Code Management Tool Case Study The management of source code is one of the greatest challenges facing Programmers today. [16-21] As programs become larger and more complex, the need to organize and manage source code increases. My motivation is to implement source code maintenance routines (with which C++ programmers gain control over their source code) which parse tokens from an ANSI C++ file, formats the file, extract header files and colorize a file. All of these are valuable routines because they can be adapted for any computer language and n be easily extended because of object oriented implementation in C++ language. [16 -21]

ISSN : 0975-5462 2669


Source code management involves two operations: Analysis and Manipulation. The analysis of source code can yield useful information about the program that may not be readily apparent or easily obtained. For example when files are shared among objects, it is difficult to track which files are dependent on others. A source code maintenance program can parse the source code and produce documentation that describes each class its member variables and functions. Source code can be manipulated to conform to standardized styles or to better display program flow. For example, a particular indentation format can facilitate code sharing amongst team members. Maintaining structure code amongst team members is extremely difficult and time consuming because programmers must modify their individual styles. One programmer may insist that the curly bracket following an if statement appear on a new line, while another programmer may prefer appending the bracket to the same line as the if statement. A source code formatter offers a convenient solution to this problem. Our focus is to design and implement a source code management program that scans code and outputs it to slightly different format. Code maintenance modules receive source code as input, break the code down into tokens and then output them in a new format. This output depends on the specific task performed by the module. The utility is based on three class groups: tokens, scanners and parsers. A scanner reads the code and breaks it down into tokens and returns them back to the parser. It also identifies the type of token to return. The parser requests successive tokens from the scanner and takes appropriate action before requesting the next token. The action of parser is to write out the token. These modules are generic enough too many programming languages with little modifications. One utility scans a C++ file and list all the filenames that have been included in the file being scanned. The second utility generates a colorized version of input file in HTML (Hypertext Markup Language). After parsing the input file HTML tags surround each token are written out to a new stream so that the token types appear in unique colors with in a browser. The utility indents the lines of the file according to rules defined for the language. This tool is indispensable for configuration management in Maintenance phase as it greatly improves the quality of the application under consideration.

We had implemented significant code maintenance routines, which can be customized easily for expansion and enhancement based on further requirements. It demonstrates the potential of a source code parser, as these routines are generic enough to be adapted to many other programming languages with little modifications as this a pure object oriented implementation. This tool goes a long way to cut maintenance costs considerably. Importantly C++ programmers feel it is desirable to identify classes and their invoked member functions in all the identified include files put together.

Refer to Figure 1 which provides the sequence diagram of the overall code maintenance process.

Figure 1: The code maintenance process

Refer to the Figure 2 which provides the Sample Class Diagram of the CToken Hierarchy

ISSN : 0975-5462 2670


Figure 2: A sample class Diagram of the CToken hierarchy

Refer to Table 1 which provides Classes derived from CToken

Table 1 Token Classes Derived from CToken

Class Name Class Description

CEOFToken End of file.

CEOLToken End of line.

CWhiteSpaceToken White space.(either a space or a tab character)

CEOLCommentToken End of line comment.

CCommentToken Inline comment(for example:/* my comment */)

CStringToken String literal(for example: “this is a string”)

CCharacterToken Character literal(for example: ‘;’)

CNumericToken Any Numeric value

CPunctuationToken Punctuation (for example ‘{‘and ‘.’)

CWordToken Any string of characters that does not fit any other categories defined in this table. Reserved words are a special type of CWordToken, but they do not have their own class. Variable and function names are types of CWordToken objects.

CLineContinuationToken This token represents the characters that cause a statement to continue onto the next line (for example: “\”)

CStatementEndToken Signifies the end of a statement. For C++ this would be a semicolon. Other languages may not have such a character. Instead, a linefeed is assumed to indicate the end of the statement.

Refer to Table 2 which provides Valid Formatting Flags

ISSN : 0975-5462 2671


Table 2 Valid Formatting flags

Enum Name Enum Description

eIndentNone Does not cause an Indent

eIndentAll Indent all the NEXT lines(until an eIndentDecrement token is encountered)

eIndentIgnore Do not indent CURRENT line at all when this token is encountered. This is used

for tokens such as #include in C++

eIndentIgnoreStatementEnd Do not increase the indent even if the statement on the previous line was not

reported as ended.

eIndentDecrement Decrement the indent count for the CURRENT and FOLLOWING lines.

eIndentStatementEnded Indicates that the text ends a statement or that the statement has ended.

eIndentLineContinuation Extend the statement to the next line. The following line should be indented as

thought the statement on the previous line was not completed.

eIndentNewLineBefore Put this token onto a new line

eIndentNewLineAfter Put a new line after this token

eIndentNewLineAfter If this token appears before a token with eIndentNewLineAfter, then ignore the

eIndentNewLineAfter flag (for example to prevent linefeed occurring inside a

C++ for statement)

Refer to Table 3 which provides Format Strings for C++

TABLE 3. Format Strings for C++

Format String Assigned enum flags

“#” eIndentIgnore

“{” eIndentAll | eIndentIgnoreStatementEnd | eIndentStatementEnded

| eIndentNewLineBefore | eIndentNewLineAfter

“}” eIndentStatementEnded | eIndentDectrement |

eIndentIgnoreStatementEnd | eIndentNewLineBefore

“:” eIndentStatementEnded

“case”,”default” eIndentDecrement | eIndentAll

“for” eIndentIgnoreNewLineAfter

“private”

“protected”

“public”

eIndentIgnore

Refer to Table 4 which provides Format Strings and Flags

TABLE 4. Format Strings and Flags

ISSN : 0975-5462 2672


String Flags Assigned

“#” The character used to define precompile items such as #define and #endif.

These lines always start in the first column so they should not be indented at all. The

indent flag, therefore, is set to eIndentIgnore.

// Library include files

# include <iostream>

“{“ Used to start a block of code.

eIndentNewLineBefore combined with eIndentNewLineAfter cause the bracket to appear

on its own line.

eIndentIgnoreStatementEnd causes the line starting with this bracket to not indent even if

the previous line did not end. The bracket could be placed following a while statement,

for example, where the line containing the while statement does not end in a semicolon.

eIndentIgnoreStatementEnd allows the bracket to appear directly below the begining of

the while statement.

eIndentStatementEnded causes the line ending with a bracket to be considered a complete

statement.

while ( token.GetType ( ) != eTokenTypeEOF )

{

cout << token;

token = scanner . GetNextToken ( );

}

“}” Decrement the current line and all the following lined because of the eIndentDecrement

flag. Thus returns the indent level to the level before the opening bracket appeared.

eIndentNewLineBefore causes the close bracket to start on a new line. Note that the

eIndentNewLineAfter flag was not set. This would cause closing comments or a

semicolon on the same line to shifted to the following line.

eIndentIgnoreStatementEnd was used so that even if the previous line did not end, the

bracket would not be indented. This is needed when the bracket is used in an enum

declaration because there are not semicolons to indicate that a statement ended within the

enum.

The last flag, eIndentStatementEnded, is used so that the line following the close bracket

is not indented.

“}” consider the following fragment

enum EType { eInteger, eFloat, eDouble }

ISSN : 0975-5462 2673


while ( bNotFinished ) { bNotFinised = GetNext ( ); } Note in the above code that eFloat is indented one more than eInteger. This is because the

eInteger line does not end with a semicolon and yet there is no character to assign the

eIndentIgnoreStatementEnd flag to. Of course, you can alter this behavior. A second

caveat concerns array definition and assignment. Because arrays are assigned using

brackets as well, the array will appear on its own line.

int array [ ] = { 1,2,3,4 }

“:” eIndentStatementEnded is used so that the statement following a case or default will not

be indented.

“case”

“default”

eIndentDecrement and eIndentAll cause the case and default lines to appear at the same

level as the switch statement.

void myfunc() { switch ( token . GetType( ) ) { case eTokenTypeEOF: sOutput += HTML_EOF; break; default: sOutput += token; } }

“for” eIndentIgnoreNewLineAfter is set to prevent a new line from occurring even if a

statement end is encountered on the same line. This is necessary because for statements

include two semicolons, which are end of statement characters.

void MyFunc ( int repeat ) { int i for(i = 0;i < repeat; i ++ ) { cout << “—“ ; } } One caveat with the for statement: If the original for statement carries over to two lines,

then each part of the autoindented for statement will also appear on its own line.

void MyFunc ( int repeat ) { int i for ( i = 0; i < repeat; i ++ ) {

ISSN : 0975-5462 2674


cout << “—“; } }

“private”

“public”

“protected”

Lines containing these tokens are placed flush against the margin in the same manner as

#define and #include. They therefore have the eIndentIgnore flag.

class MyClass { public : MyClass ( ); ~MyClass( ); protected : bool bInitialized; }

Refer to Figure 3 which provides Colorization of tokens by SCodeMnt demo1ai.cpp /html

Figure 3. Colorization of tokens by SCodeMnt demo1ai.cpp /html

Refer to Figure 4 which provides Summarization of tokens by SCodeMnt demo1ai.cpp /html

ISSN : 0975-5462 2675


Figure 4. Summarization of tokens by SCodeMnt demo1ai.cpp /html

For details of implementations source code and documentation please refer to the web site http://sites.google.com/site/kpresearchgroup

5. CONCLUSION & FUTURE WORK

Further work includes: In modern integrated software engineering environments, software engineers must be able to collect and mine software engineering data on the fly to provide rapid just-in-time feed back. SE researchers usually conduct offline mining of data already collected and stored. Stream data mining algorithms and tools could be adopted or developed to satisfy such challenging mining requirements. This tool can be extended with features like: We can colorize or auto indent other languages by defining new load functions similar to LoadScannerCPP( ) and LoadFormatCPP( ), so that additional languages can also be displayed in a browser window or automatically formatted, An even better solution would be to change the program so that it can read a language-definition file. Using this method, the language definition could be loaded at run-time and be data driven – rather than having to write code each time you wish to parse a new language. Also, we can even add a new function that creates a file containing each unique token found in a program file. This token file could be used as a custom dictionary by a word processor, for example. Finally, we can create a cross-reference database that indicates each line where a particular function or variable is used. 6. REFERENCES

[1]. Tao Xie, Suresh Thummalapenta, David lo, Chao Liu,”Data Mining for Software Engineering”, IEEE Computer, August 2009, pp. 55-62. [2]. Hamid Abdul BAsit, Stan Jarzabek, “ A Data Mining approach for detecting higher-level clones in Software”, IEEE Transactions on Software Engineering, Vol. 35, No. 4, July/August 2009, pp. 497-514. [3]. Ivano Malavelta, Henry Muccini, Patrizio Pellicciona, Damien Andrew Tamburri, “Providing Architectural Languages and Tools Interoperability through Model Transformation Technologies”, IEEE Transactions on Software Engineering, Vol. 36, No. 4, January/February 2010, pp. 119-140. [4]. Tao Xie, Jain Pei, Ahmed E Hassen, “Mining Software Engineering Data”, IEEE 29 th International Conference on Software Engineering ICSE 07. [5]. Francisco P.Romero, Jose A.Olivas, MArcele Genero, Mario Piattini, “Automatic Extraction of the main terminology used in Empirical Software Engineering through Text Mining Techniques” ACM ESEM 08 pp. 357 – 358. [6]. Mohammed J Zaki, Christopher D Carothes, Boleslan K Szymaski, “VOGUE: A Variable Order hidden Markov Model with duration based on Frequent Sequence Mining”, ACM Transactions on Knowledge Discovery from Data, Vol. 4 No.1, Article 5, January 2010.

ISSN : 0975-5462 2676


[7]. Francine Bermas, “Got Data? A guide to data preservation in the Information Age”, Communications of the ACM, December 2008 Vol 51, No.12, pp. 50-56. [8]. Nizar R Mabroukeh, Christe I Ezeite,” Using Domain Ontology for Semantic Web Usage Mining and Next Page Prediction”, ACM CIKM 08 pp. 1677 – 1680. [9]. Tim Menzein, Gary D Boettiecher, “Smarter Software Engineering: Practical Data Mining Approaches”, IEEE/NASA 27 Th Annual Software Engineering Workshop 2002. [10]. Josh Eno, Craig W Thompson,” Generating Synthetic Data to Match Data Mining Patterns”, IEEE Internet Computing May/June 2008 pp. 78 – 82. [11]. O.Maqbool, A Karim, H.A.Babri, Misarwar, “Reverse Engineering using Association Rules”, IEEE INMIC 2004, pp. 389 -395. [12]. Gang Kou Yipeng, “A Standard for Data Mining based Software Debugging”, IEEE 4 Th International Conference on Networked Computing and advanced Information Management, pp. 149 – 152. [13]. Qi Wang, Bo yo, Jie Zhu, “Extract Rules from Software Engineering Quality Prediction Model based on Neural Networks”, ICTAI 2004. [14]. Ngoavel Moha, Yann-Gael Gueheneu, Laurence Duchien, Anne-Fran Coisele Mew, “DÉCOR – A Method for the Specification and Detection of Code and Design Smells”, IEEE Transactions on Software Engineering, Vol. 36, No. 4, January/February 2010, pp. 20-36. [15]. Ray-Yaung Chang, Andy Podgurski, Jiong Yang, “Discovering Neglected Condition in Software by Mining Dependency Graphs”, ”, IEEE Transactions on Software Engineering, Vol. 34, No. 5, September/October 2008, pp. 579-596. [16] Brain W. Kernighan, Rob pike, “The practice of programming”, Addison Wesley publishers, 1999. [17] Victor R Volkman, “C/C++ Programmers Tools and Libraries A Developers Resource Kit of C/C++ and Source Code”, R&D Books Miller Freeman Inc., 1998 [18] Herbert Schildt,”The Art of C++”, TataMGH, 2004 [19] Steven McCornell, “After the gold rush: creating a true profession of Software Engineering”, Microsoft press, 1999 [20] C, C++, C# A reality check, Developer IQ Software Technology Magazine, Vol 5 Number 10 October 2005 [21]www.cio.com/article/120802/Source_Code_Management_Systems_Trends_Analysis_and_Best_Features (Last accessed on 28/06/2010)

ISSN : 0975-5462 2677

ijest10-02-07-39

Documents