[ieee 2011 26th ieee/acm international conference on automated software engineering (ase) -...

Tracing Requirements to Tests with High Precision

and Recall

Celal Ziftci

Computer Science Department

University of California at San Diego

[email protected]

Ingolf Krueger

Computer Science Department

University of California at San Diego

[email protected]

Abstract—Requirements traceability is linking

requirements to software artifacts, such as source code, test-

cases and configuration files. For stakeholders of software, it is

important to understand which requirements were tested,

whether sufficiently, if at all. Hence tracing requirements in

test-cases is an important problem. In this paper, we build on

existing research and use features, realization of functional

requirements in software [15], to automatically create

requirements traceability links between requirements and test-

cases. We evaluate our approach on a chat system, Apache

Pool [21] and Apache Log4j [11]. We obtain precision/recall

levels of more than 90%, an improvement upon currently

existing Information Retrieval approaches when tested on the

same case studies.

Keywords-requirements traceability; testing; program

understanding; automated analysis

I. INTRODUCTION

Requirements Traceability (RT) is defined as the “ability to describe and follow the life of a requirement, in both a forward and backward direction” [1], by “defining and maintaining relationships to related development artifacts” [2] such as source code, test cases and configuration files.

In this paper, we focus on traceability of requirements to tests. Testing is an important part of the software development lifecycle, employed by many software development teams. Based on empirical studies, in many systems, the amount of test code produced is comparable to the code produced for the system itself, ranging from 50 percent less to 50 percent more [5, 26]. Having many tests increases the cost and effort spent on testing, hence the importance of traceability in tests for several stakeholders including developers, testers and managers [3, 9].

Tool support is available to record, maintain and retrieve trace information manually [16]. However, this is error-prone, time consuming, and labor-intensive [3, 25]. Furthermore, RT links get out of date as software evolves. Therefore, it is important and convenient to create, maintain and find out traceability links in tests via an automated process.

In this paper, we represent functional requirements of software using "features", observable units of behavior of a system that can be triggered by a user [15]. We use features to automatically create traceability links between

requirements and test cases. Our work makes the following contributions: (a) A new method for creating traceability links between functional requirements and any kind of executable tests. Our method improves upon precision/recall values obtainable by recent well-known test traceability methods. In our case studies, we observed precision/recall values higher than 90%. (b) An automated process to create the requirements to test traceability links as a by-product of automated software development processes, such as Test Driven Development (TDD) and continuous integration. Unlike the existing approaches, the traceability links always stay up to date, because the requirements specifications are executable.

II. BACKGROUND AND RELATED WORK

Commonly used metrics to measure the quality of a traceability method, precision and recall, describe the accuracy and completeness, respectively, of the retrieved trace links compared to the relevant trace links (see Fig. 1). For RT purposes, obtaining recall values close to 100% is important, since this represents finding all trace links [18]. However, getting high precision values with high recall is also very important, since this represents low number of false positives [7, 18].

Recent effective automated methods for test traceability rely on the analysis of textual information derived from documentation and test code using Information Retrieval (IR) techniques [1, 2, 6, 7, 8, 18, 19]. A well-studied approach is Latent Semantic Indexing (LSI) [20]. In LSI, the documentation of requirements and the source code of test cases are assumed to be likely to share some common and/or synonymous terms. These commonalities between requirements and tests are exploited to retrieve trace links [7]. However, to achieve high accuracy, these methods require rich requirements descriptions and well documented and well maintained source code with up to date comments and source code conventions. They typically suffer from low precision (many false positives) on high recall values [7, 18].

Our work overcomes the challenges of the IR approaches

by using a different approach: it builds on requirements to

source code traceability approaches that use scenarios,

executable specifications for requirements, to find

requirements traces in source code [4, 12, 13, 15, 17]. We

then use these traces to find requirements traces in tests.

978-1-4577-1639-3/11/$26.00 c© 2011 IEEE ASE 2011, Lawrence, KS, USA

472

Compared to IR approaches, our method uses a more

accurate description of requirements, which results in

accurate traces in source code, which in turn results in more

accurate traces in tests.

III. FORTA: TRACING REQUIREMENTS TO TESTS VIA

FEATURES

In this paper, we use the term feature as the realization of

a functional requirement in a system [15], and use the terms

requirement and feature interchangeably.

Fig. 2 summarizes the inputs, flow and output of our

approach and tool (FORTA: Feature Oriented Requirements

Traceability Analysis).

To use our approach, functional requirements, i.e.

features, need to be identified first (Step 1 in Fig. 2).

Considering a “Chat System” example, sample features are

sign-on and send-message. Then we create scenarios that

exercise each feature (such as signing-in in the chat system

using the graphical user interface, or implementing a unit-

test to perform the behavior).

As each scenario is executed, using a profiler or a similar

technology, our tool gathers execution traces, which contain

execution unit information, such as class and method names

(Step 2 in Fig. 2).

Given the execution traces for a scenario, we then find

execution units that can distinguish features from each

other, i.e. execution units observed in that feature but no

others (Step 3 in Fig. 2). We call these distinguishing

execution units feature markers. To do this, we build upon a

well-known approach [17], probabilistic ranking of the

methods observed in the execution trace of a feature

scenario using the following heuristic: if an execution unit is

observed in a single feature only, it is most likely to

represent that requirement, and not so, otherwise. However,

we take the technique in [17] further by allowing the use of

multiple scenarios for a single feature. This way, we do not

miss trace links for features that have multiple ways of

being triggered.

Next, we run the tests of the system and again gather

execution traces (Step 4 in Fig. 2).

Then we find the feature markers of each feature in the

execution traces of the tests (Step 5 in Fig. 2), which is a

contribution of our method. This reveals which test cases

exercised which features. This way, we find the traceability

links between requirements and tests in the form of a

traceability matrix (Step 6 in Fig. 2).

IV. CASE STUDIES

To assess the validity of our approach, we conducted three case studies: a chat system used in teaching a Software Engineering class at UCSD, the open source libraries Apache Pool [21], and Apache Log4j [11]. All of the systems are implemented in Java, and they already had tests prepared to be run with JUnit [24]. Table I summarizes the statistics relevant to requirements traceability for each project.

To compare our results, we implemented two recent, well-known IR techniques for requirements tracing to tests: “Term Frequency Inverse Document Frequency” (TF-IDF) [10], and “Latent Semantic Indexing” (LSI) [20].

We used precision-recall as the indicator of success for each method. Getting a higher precision value in high recall ranges means finding relevant links and most of them correctly, hence providing better traceability results in tests.

The rest of this section explains the input preparations of the case studies for TF-IDF, LSI and FORTA. Finding Requirements: The requirements were gathered through projects’ webpages, javadocs, and comments manually which took about two hours for each project. This preparation corresponds to Step 1 in Fig. 2. Creating Scenarios: As a preparation to Step 2 in Fig. 2, we created scenarios for the chat system manually using the provided graphical user interface of the system. We created scenarios for Apache Pool [21] and Apache Log4j [11] as executable unit tests themselves using annotations in Java, which took about 1 hour for each project. Collecting Execution Traces: For steps 2 and 4 in Fig 2, the execution traces were collected using AspectJ [13] while the scenarios and tests were running. Inputs to TF-IDF and LSI: Both of these approaches require requirements documentations, which were gathered from projects’ manuals and javadocs manually. They also require

Identify Functional

Requirements/

Features

Run scenarios &

gather execution

traces for each

feature

Run tests & gather

execution traces

for each test

Find feature

markers for

each feature

Tag each test

with features it

exercises

Create

Requirements

Traceability

Matrix

Requirements

Traceability

Matrix

Au

tom

atic

an

aly

sis

to fin

d R

eq

uire

me

nts

Tra

ce

ab

ility

Lin

ks

an

d T

est

In

ten

tsIn

pu

t P

rep

ara

tio

n

Output

1

2

3

4

5 6

Figure 2. Inputs, outputs and flow of our approach. The inputs are

execution traces of scenarios and tests, while the output is the Requirements Traceability Matrix between requirements and tests. The

steps with a star indicate a novel contribution of our approach, while

steps with a + indicate that we use existing research and make a contribution additionally.

relevant

retrieved

Figure 1. Precision and Recall: precision is the correctness of the

retrieved links, while recall is the coverage of the relevant links.

| |

| |

| |

| |

{ | } { | }

473

source code of the tests as text, so the test code was parsed using the Eclipse Java abstract syntax tree parser [14] and the terms were indexed using Apache Lucene [22].

V. DISCUSSION

Table II summarizes the results of our approach, along with the results of using the IR techniques TF-IDF and LSI. First, we were able to reproduce and confirm the precision/recall results reported in [7]. We were also able to confirm that LSI performs better than TF-IDF [10], since it can additionally match terms that are synonymous and distinguish those that are polysomic.

When run on the same case studies, our approach achieves better precision (>90%) on high recall levels (first priority in traceability is retrieving all links [18], so we considered recall values >90%, and included in the results the best precision for the recall values >90%), because it uses a more accurate, executable description of requirements. Although our approach requires some extra effort in the initial preparation of the scenarios, we observed this effort to be negligible because the scenarios were short (2-3 lines) compared to tests. They also have the extra benefit that they stay up to date as software evolves (because they can be implemented as unit-tests), which offsets the initial investment in creating them.

As long as there is a profiler available, our approach is agnostic to the programming language used in the system, and can easily be extended to work for other languages.

One limitation of our approach is that it only applies to functional requirements of software currently. Our approach can be complemented with existing approaches like IR to detect non-functional requirements (robustness, security).

An important lesson we learned is that our technique can provide even better results if some customization is performed to handle cases specific to the type of the programming language used to implement the system (such as polymorphism in object-oriented languages).

Another technique that can help boost the accuracy of our approach is excluding the utility execution units (classes, methods) in the traces. Currently this is partly achieved already if there are a reasonable number of features due to the probabilistic ranking of execution units. However, this can be further complemented by drawing upon existing utility class/method detection techniques in the literature [23].

Another big advantage of our approach is that, scenarios are test cases themselves. After source code refactorings/changes, developers fix tests to keep them passing after the changes, which applies to the scenarios as well. Although this demands some effort from developers,

the RT links will stay up to date as software evolves. This does not hold for the IR approaches, because, to specify requirements, they use documentation which may get out of date as software evolves.

A. Threats to Validity

The first threat to the validity of our results is the number of the case studies and the extent they represent production software systems. We chose them to be from different domains to mitigate this threat, which can be further reduced if we experimented with more software systems of varying size from more domains.

Another threat is the selection of requirements and scenarios to obtain execution traces for FORTA. Since we are not domain experts of the software used in the case studies, we cannot claim that we found all requirements and our scenarios are the best ones to capture them. Similarly, we did not use all of the tests for all projects in our analysis, since some tests were exercising requirements that we were not able to identify. For the sake of time, we opted to use only those requirements we could identify, instead of going back and adding more requirements after analyzing the results of the case studies.

Another threat is the preparation of the ground truth for the traceability results in our case studies, which we performed manually. To mitigate risk, we asked two developers to do these tasks and confirmed the results comparing their responses. However, mistakes might still have happened.

Finally, we could not prepare the complete ground truth for Apache Log4j due to the number of test-cases it had, and used only a randomly chosen subset of them. We chose tests from different test classes and packages to mitigate this factor.

VI. CONCLUSION

Requirements traceability (RT) is an important and active research area with many benefits [3, 9]. This paper focuses

TABLE II. REQUIREMENTS TRACEABILITY RESULTS

Methodology

Project Metric TF-IDF LSI FORTA

Chat System

Precision 23% 27% 99%

Recall 99% 93% 99%

TSPa (sec.) 0.017 0.017 0.620

TSAb (sec.) 1.009 1.023 0.537

SETc (MB) - - 8.5

STCd (MB) 0.376 0.376 -

Apache Pool

[21]


Recall 92% 100% 98%

TSPa (sec.) 0.023 0.023 1.257

TSAb (sec.) 1.333 1.445 0.647

SETc (MB) - - 12.6

STCd (MB) 1.14 1.14 -

Apache Log4j

[11]


Recall 100% 100% 100%

TSPa (sec.) 0.022 0.022 0.495

TSAb (sec.) 1.190 1.237 0.359

SETc (MB) - - 10.65

STCd (MB) 0.780 0.780 -

a. TSP: Time spent on preparation (in seconds)

b. TSA: Time spent on analysis (in seconds)

c. SET: Size of execution trace (in megabytes)

d. STC: Size of source code for tests (in megabytes)

TABLE I. CASE STUDY PROPERTIES

Project

#

Lines

of

Code

# Lines

of Test

Code

Ratio of

test code

to source

code

#

Features

# Test

Cases

Chat

System

6,861 3257 0.47 16 20

Apache

Pool [21] 12,626 8690 0.69 16 77

Apache Log4j [11]

52,886 15952 0.30 10 69

474

on the RT problem in tests because testing is an important step in the software development lifecycle, and increasing amounts of test code in production systems [5, 26] escalates the importance of tracing requirements in tests [3].

In this paper, we draw upon existing research to represent functional requirements of a system with features, observable units of behavior of a system that can be triggered by a user [4, 15]. We build on existing research [17] to find features in source code using scenarios, executable actions that trigger features. We take this approach further by using multiple scenarios for a single feature so that we do not miss trace links for features that have multiple ways of being triggered. We then find requirements traces in tests using the traces found in the source code.

Our approach achieves better precision-recall (>90%) than the currently known approaches [1, 2, 6, 7, 8]. Our approach also has many benefits: it does not require the existence of the source code or documentation of the system, it works for different programming languages and on different levels of abstraction preferred, and it is fully automated with no need for human intervention during the analysis.

Finally, we propose an automated process and provide tool support (FORTA) to create the requirements to test traceability links as a by-product of automated software development processes, such as TDD and continuous integration. Using our approach, the traceability links never get out of date due to code and requirement changes, unlike the IR approaches.

Given accurate and complete RT links, many further research possibilities exist to analyze a software system. Some immediate areas are: to monitor the achieved coverage of requirements with existing tests, to provide test failure analysis prioritization and finally to provide test case prioritization.

ACKNOWLEDGEMENTS

We would like to thank the anonymous reviewers who helped improve this paper. This research was supported in part by NSF Grant CNS-0932403.

REFERENCES

[1] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo, “Recovering traceability links between code and documentation,” IEEE Transactions on Software Engineering, vol. 28, no. 10, pp. 970–983, 2002.

[2] A. D. Lucia, F. Fasano, R. Oliveto, and G. Tortora, “Recovering traceability links in software artifact management systems using information retrieval methods,” ACM Trans. Softw. Eng. Methodol., vol. 16, September 2007.

[3] O. C. Z. Gotel and C. W. Finkelstein, “An analysis of the requirements traceability problem,” in Proc. First Int Requirements Engineering Conf, 1994, pp. 94–101.

[4] A. Egyed and P. Grunbacher, “Supporting software understanding with automated requirements traceability,” International Journal of Software Engineering and Knowledge Engineering, vol. 15, p. 783, 2005.

[5] E. M. Maximilien and L. Williams, “Assessing test-driven development at IBM,” in Proc. 25th Int Software Engineering Conf, 2003, pp. 564–569.

[6] J. H. Hayes, A. Dekhtyar, and S. K. Sundaram, “Advancing candidate link generation for requirements tracing: the study of methods,” IEEE Transactionson Software Engineering, vol. 32, no. 1, pp. 4–19, 2006.

[7] M. Lormans and A. van Deursen, “Can LSI help reconstructing requirements traceability in design and test?” in Proc. 10th European Conf. Software Maintenance and Reengineering CSMR 2006, 2006.

[8] A. Marcus, J. I. Maletic, and A. Sergeyev, “Recovery of traceability links between software documentation and source code,” International Journal of Software Engineering and Knowledge Engineering, vol. 15, pp. 811–836, 2005.

[9] T. Tamai and M. I. Kamata, “Impact of requirements quality on project success or failure,” in Design Requirements Engineering: A Ten-Year Perspective, ser. Lecture Notes in Business Information Processing. Springer Berlin Heidelberg, 2009, vol. 14, pp. 258–275.3

[10] K. Spaerck Jones, “A statistical interpretation of term specificity and its application in retrieval,” Journal of Documentation, vol. 28, pp. 11–21, 1972

[11] “Apache log4j,” http://logging.apache.org/log4j/, accessed: 07 May 2011.

[12] N. Wilde and M. C. Scully, “Software reconnaissance: Mapping program features to code,” Journal of Software Maintenance: Research and Practice, vol. 7, no. 1, pp. 49–62, 1995.

[13] “AspectJ,” http://www.eclipse.org/aspectj/, accessed: 07 May 2011.

[14] “Eclipse AST Parser,” http://help.eclipse.org/helios/index.jsp?topic =/org.eclipse.jdt.doc.isv/reference/api/org/eclipse/jdt/core/dom/ASTParser.html, accessed: 07 May 2011.

[15] T. Eisenbarth, R. Koschke, and D. Simon, “Locating features in source code,” IEEE Transactions on Software Engineering, vol. 29, no. 3, pp. 210–224, 2003.

[16] “Doors,” http://www-01.ibm.com/software/awdtools/doors/, accessed: 07 May 2011.

[17] D. Poshyvanyk, Y.-G. Gueheneuc, A. Marcus, G. Antoniol, and V. Rajlich, “Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval,” IEEE Transactions on Software Engineering, vol. 33, no. 6, pp. 420–432, 2007

[18] X. Zou, R. Settimi, and J. Cleland-Huang, “Improving automated requirements trace retrieval: a study of term-based enhancement methods,” Empirical Software Engineering, vol. 15, pp. 119–146, 2010.

[19] C. McMillan, D. Poshyvanyk, and M. Revelle, “Combining textual and structural analysis of software artifacts for traceability link recovery,” in Proc. ICSE Workshop Traceability in Emerging Forms of Software Engineering TEFSE ’09, 2009, pp. 41–48.

[20] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman, “Indexing by latent semantic analysis,” Journal of the American Society of Information Science, vol. 41, no. 6, pp. 391–407, 1990.

[21] “Apache pool,” http://commons.apache.org/pool/, accessed: 07 May 2011.

[22] “Apache Lucene,” http://lucene.apache.org/java/docs/index.html, accessed: 07 May 2011.

[23] A. Hamou-Lhadj and T. Lethbridge, “Summarizing the content of large traces to facilitate the understanding of the behaviour of a software system,” in Proc. 14th IEEE Int. Conf. Program Comprehension ICPC 2006, 2006, pp. 181–190.

[24] “Junit,” http://www.junit.org/, accessed: 07 May 2011.

[25] S. Brinkkemper, “Requirements engineering research the industry is and is not waiting for,” in Proceedings of the 10th International Workshop on Requirements engineering: Foundation for Software Quality, 2004.

[26] L. Williams, E. M. Maximilien, and M. Vouk, “Test-driven development as a defect-reduction practice,” in Proc. 14th Int. Symp. Software Reliability Engineering ISSRE 2003, 2003, pp. 34–45.

475

[ieee 2011 26th ieee/acm international conference on automated software engineering (ase) -...

Documents