icpc 2011 - improving ir-based traceability recovery using smoothing filters
DESCRIPTION
Information Retrieval methods have been largely adopted to identify traceability links based on the textual similarity of software artifacts. However, noise due to word usage in software artifacts might negatively affect the recovery accuracy. We propose the use of smoothing filters to reduce the effect of noise in software artifacts and improve the performances of traceability recovery methods. An empirical evaluation performed on two repositories indicates that the usage of a smoothing filter is able to significantly improve the performances of Vector Space Model and Latent Semantic Indexing. Such a result suggests that other than being used for traceability recovery the proposed filter can be used to improve performances of various other software engineering approaches based on textual analysis.TRANSCRIPT
![Page 1: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/1.jpg)
Improving IR-based Traceability Recovery Using Smoothing Filters
Andrea Massimiliano Rocco Annibale SebastianoDe Lucia Di Penta Oliveto Panichella Panichella
![Page 2: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/2.jpg)
Software traceability
“The degree to which a relationship can be established between two products of a software development process” [IEEE Glossary for Software Terminology]
Important for: program comprehension requirement tracing impact analysis software reuse …
Up-to-date traceability links rarely exists → need to recover them
Up-to-date traceability links rarely exists → need to recover them
Use case Sourcecode Test case
Use case Sourcecode Test case
![Page 3: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/3.jpg)
IR-based traceability recovery
Antoniol et al., 2002 (VSM+Probabilistic model) Marcus and Maletic, 2003 (LSI)
Antoniol et al., 2002 (VSM+Probabilistic model) Marcus and Maletic, 2003 (LSI)
![Page 4: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/4.jpg)
Traditional IR vs.IR applied to Software Engineering
Traditional IR Deals with
heterogeneous documents for what concerns: Linguistic choices Syntax Semantics
We just live with that differences
IR applied to SE We have sets of
homogeneous documents for what concerns Syntax, linguistic
choices Examples:
Use cases, test documents, design documents follow a common template and contain recurrent words
![Page 5: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/5.jpg)
Test case Change the date for a visit:
C51 Version: 0 02 000
Use case Satisfies the request to modify a visit for a patient
UcModVis
Priority High
....
Test description
Input Select a visit:
26/09/2003 11:00 First visit
Change: 03/10/2003 11:00
Oracle Invalid sequence: The system does not allow to change a booking
Coverage Valid classes: CE1 CE8 CE14 CE19 CE21
Invalid classes: None
Test case Change the date for a visit:
C51 Version: 0 02 000
Use case Satisfies the request to modify a visit for a patient
UcModVis
Priority High
....
Test description
Input Select a visit:
26/09/2003 11:00 First visit
Change: 03/10/2003 11:00
Oracle Invalid sequence: The system does not allow to change a booking
Coverage Valid classes: CE1 CE8 CE14 CE19 CE21
Invalid classes: None
Problem
Different kinds of software artifacts require specific preprocessing
![Page 6: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/6.jpg)
Test case Change the date for a visit:
C51 Version: 0 02 000
Use case Satisfies the request to modify a visit for a patient
UcModVis
Priority High
....
Test description
Input Select a visit:
26/09/2003 11:00 First visit
Change: 03/10/2003 11:00
Oracle Invalid sequence: The system does not allow to change a booking
Coverage Valid classes: CE1 CE8 CE14 CE19 CE21
Invalid classes: None
Test case Change the date for a visit:
C51 Version: 0 02 000
Use case Satisfies the request to modify a visit for a patient
UcModVis
Priority High
....
Test description
Input Select a visit:
26/09/2003 11:00 First visit
Change: 03/10/2003 11:00
Oracle Invalid sequence: The system does not allow to change a booking
Coverage Valid classes: CE1 CE8 CE14 CE19 CE21
Invalid classes: None
Problem
Different kinds of software artifacts require specific preprocessing
Artifact-specif ic words do not bring useful
information
![Page 7: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/7.jpg)
A similar problem: image processing
![Page 8: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/8.jpg)
Pixels with peaks of low color intensity
Noise
Noisy images
Pixels with peaks of high color intensity
![Page 9: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/9.jpg)
Mean filter
Reducing noise using smoothing filters
∑∈
=Smnf
mnfM
yxg),(
),(1
),(
![Page 10: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/10.jpg)
Image vs. traceability noise
Image noise: Pixels with high or
low color intensity Pixels are position
dependent
Traceability noise: Terms and linguistic
patterns occurring in many artifacts of a given category Use cases, test
cases.. Artifacts (columns) are
position independentd1 d2 d2 d1
![Page 11: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/11.jpg)
1 2 3 1 2 3
1,1 1,2 1,3 1,1
2,1 2,2 2,3 2,2
,1 ,2 ,3
s s s s t t t t
k z
k
n n nn
v v v vword
v v v vword
v v vword
L L
L
L
M O M O MM O
L L L
1,1 1,2 1,3 1,
2,1 2,2 2,3 2,
, ,1 ,2 ,3 ,
z
k z
n k n n n n z
v v v v
v v v v
v v v v v
L
L
M M O M O M MO
L L L
Source Documents Target Documents
Linguistic information strictly belonging to source documents
Linguistic information strictly belonging to target documents
Common Information for Source Documents
Common Information For target documents
Representing the noise
![Page 12: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/12.jpg)
1 2 3 1 2 3
1,1 1,2 1,3 1,1
2,1 2,2 2,3 2,2
,1 ,2 ,3
s s s s t t t t
k z
k
n n nn
v v v vword
v v v vword
v v vword
L L
L
L
M O M O MM O
L L L
1,1 1,2 1,3 1,
2,1 2,2 2,3 2,
, ,1 ,2 ,3 ,
z
k z
n k n n n n z
v v v v
v v v v
v v v v v
L
L
M M O M O M MO
L L L
Mean source vector Mean target vector
1,1
2,1
,1
1
1
1
k
jj
k
jj
k
n jj
vk
vkS
vk
=
=
=
=
∑
∑
∑M
1,1
2,1
,1
1
1
1
m
jj k
m
jj k
m
n jj k
vz
vzT
vz
= +
= +
= +
=
∑
∑
∑M
Representing the noise
Source Documents Target Documents
Common Information for Source Documents
Common Information For target documents
The Mean Vectors are like the continuous component of a signal…The Mean Vectors are like the continuous component of a signal…
S= T=
![Page 13: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/13.jpg)
1 2 3 1 2 3
1,1 1,2 1,3 1,1
2,1 2,2 2,3 2,2
,1 ,2 ,3
s s s s t t t t
k z
k
n n nn
v v v vword
v v v vword
v v vword
L L
L
L
M O M O MM O
L L L
1,1 1,2 1,3 1,
2,1 2,2 2,3 2,
, ,1 ,2 ,3 ,
z
k z
n k n n n n z
v v v v
v v v v
v v v v v
L
L
M M O M O M MO
L L L
S(mean target
vector)
T(mean target vector)
Representing the noise
Source Documents Target Documents
FilteredSource SetFiltered
Source Set
-
FilteredTarget SetFiltered
Target Set
-
![Page 14: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/14.jpg)
Empirical Study
Goal: analyze the effect of smoothing filter Purpose: investigating how the filter affects
traceability recovery Quality focus: traceability recovery performance
(precision and recall) Perspective:
Researchers: evaluating the novel technique Project managers: adopt a better traceability recovery
technique Context: artifacts from two systems
EasyClinic and Pine
![Page 15: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/15.jpg)
Context
EasyClinic Pine
Description Medical doctor office management
Text-based email client
Language Java CFiles/Classes
37 31
KLOC 20 130Documents 113 100Language Italian EnglishArtifacts Use cases
Interaction diagramsSource codeTest cases
RequirementsUse cases
![Page 16: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/16.jpg)
Research Questions and Factors
RQ1: Does the smoothing filter improve the recovery performances of VSM-based traceability recovery?
RQ2: Does the smoothing filter improve the recovery performances of LSI-based traceability recovery?
RQ3: How do the performances vary for different types of artifacts?
Factors: Use of filter: YES, NO Technique: VSM, LSI Artifact: Req., UC, Int. Diagrams, Code, TC System: Easyclinic, Pine
![Page 17: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/17.jpg)
Analysis Method
We statistically compare the # of false positives of different methods for each correct link identified Wilcoxon Rank Sum test Cliff’s delta effect size
correct
retrievedcorrectrecall
∩=
retrieved
retrievedcorrectprecision
∩=
Performances evaluated by precision and recall:
2
2
0
3
M1 M2
![Page 18: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/18.jpg)
Results
![Page 19: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/19.jpg)
EasyClinic: Use cases into source (VSM)
Recall
Prec
isi o
n
Filtered
Not Filtered
![Page 20: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/20.jpg)
EasyClinic: Use cases into source (LSI)
Filtered
Not Filtered
Prec
isi o
n
Recall
![Page 21: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/21.jpg)
EasyClinic: Test cases into source (LSI)
Filtered Not Filtered
Prec
isi o
n
Recall
Test cases are: Short documents Limited vocabulary Mostly consistent with
source code
![Page 22: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/22.jpg)
Pine: Use cases into requirements (LSI)
Filtered
Not Filtered
Prec
isi o
n
Recall
![Page 23: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/23.jpg)
Statistical Comparison
Data set TracedArtifacts
VSM LSI
p-value Effect size p-value Effect size
EasyClinic UC→Code <0.01 0.50(large)
<0.01 0.50(large)
Int. Diag. → Code
<0.01 0.52(large)
<0.01 0.34(medium)
Pine TC → Code 1.00 -(negligible)
1.00 -(negligible)
Req. → UC <0.01 0.58(large)
<0.01 0.58(large)
![Page 24: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/24.jpg)
Link precision improvement
Login Patientvs. PersonPoor vocabulary overlap (10%)
Login Patientvs. PersonPoor vocabulary overlap (10%)
![Page 25: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/25.jpg)
Threats to validity
Construct validity Mainly related to our oracle Provided by developers and for EasyClinic also peer-
reviewed Internal validity
Improvements could be due to other reasons… However we compared different techniques (VSM, LSI) The approach works well regardless of stop word
removal/stemming and use of tf-idf Conclusion validity
Conclusions based on proper (non-parametric) statistics External validity
We considered systems with different characteristics and artifacts
… but further studies are desirable
![Page 26: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/26.jpg)
Conclusions
We proposed the use of smoothing filter to improve performances of IR-based traceability recovery
Idea inspired from digital signal processing The filter significantly improves IR-based
traceability recovery based on VSM (RQ1) and LSI (RQ2)
Filter particularly suitable for artifacts having a higher verbosity (RQ3) e.g., requirements and use cases
Less useful for artifacts composed of short sentences and using a limited vocabulary e.g., test cases
![Page 27: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/27.jpg)
Work-in-progress
Study replication Different systems and artifacts Use of relevance feedback
More sophisticated smoothing technique Non linear filters
Use in other applications of IR to software engineering E.g. impact analysis or feature location
![Page 28: ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters](https://reader033.vdocuments.mx/reader033/viewer/2022060121/559448881a28ab150d8b46ec/html5/thumbnails/28.jpg)
That’s all folks!