recovering traceability links in requirements documents …horacek/ontologies-presentation.pdf ·...
Post on 06-Mar-2018
218 Views
Preview:
TRANSCRIPT
RECOVERING TRACEABILITY LINKS IN REQUIREMENTS DOCUMENTS
Zeheng Li Mingrui Chen LiGuo Huang
Department of Computer Science & Engineering
Southeren methodist University
Dallas, TX 75275-0122
Vincet Ng
Human Language Technology Institute
University Of Texas at Dallas
Richardson, TX 75083-0688
Presented By
Narendra Narisetti
Cuk,2552738 1
Introduction
• Software system development initialized with evaluation andrefinement of requirements.
• Documenting those requirements using natural language iscalled “requirements documents”.
• The requirements are refined with additional design detailsand implementation information.
• Linking of requirements in which one is refinement of other iscalled ‘’ requirements traceability’’.
2
Types of Requirements
• Specifically, requirements can be divided into two types:
1. High Level Requirements(coarse-grained)
2. Low Level Requirements(fine-grained)
• Requirement traceability links each high-level requirementwith all the low-level requirements that improves.
• The traceability mapping is many-to-many .
3
Example: Pine email system by Sultanov and Hayes
Figure 1: Sample of high- and low-level requirements4
1
2
3
Drawbacks:
• Information irrelevant to the establishment of one link is related to establishment of other link in same requirement.
Example: Description section in UC01 is irrelevant to the HR02 but it is relevant to HR01 for linking.
• Link can exist between a pair of requirements even if they don’t have similar content words or overlapping.
5
Requirements Traceability Approaches:
• It is classified as two types:
Manual approaches: Requirements traceability links arerecovered manually by developers.
Automated approaches: Depends on information retrieval(IR)techniques to generate links automatically.
6
Automated approaches
• Binary classification tasks.
• Measures similarity between high and low level requirements.
• Classifying positive means high and low level requirements are linked.
• Information retrieval (IR) techniques are used for traceability link prediction.
7
Supervised Learning Methods
• Supervised methods are employed with two types of humansupplied knowledge:
i) Annotator rationales : It contains the information relevantto the establishment of link by the human annotator.
we use this rationales to create additional training instancesfor the learner.
ii) Ontology hand-built: It is defined by a domain expert tocreate additional training features for learner. (see next slide)
8
Hand-built ontology of pine
9
Why ontology based features are useful for traceability links?
1.Only those verbs and nouns appear in training data
2. For link identification , verbs and nouns are deemed relevant by domain expert in ontology.
3. Robust generalization of the words/phrases .
10
Hand-built Ontology
Manual Vs Automated
Manual Approach
1. System analysts uses requirement management tools to build RTM.
2. Rational DOORS, Rational RequisitePro, CASE .
3. It is human-intensive so error prone gives large set of requirements.
Automated Approach
1. Calculate textual similaritybetween requirements.
Ex: Cosine coefficients, Jaccard
2. Tf-idf-based vector spacemodel, Latent DirichletAllocation.
3. Depend on IR techniques.
11
For our evaluation we are taking second dataset“WorldVistA” , an electronic health informationsystem developed by the USA veteransadministration along with pine email system.
Datasets
Table 1: Statistics on the Datasets12
Manual ontology for WorldVistA
13
Manual ontology for WorldVistA
14
15
Baseline Systems
• It employs different methods for traceability prediction.
Baseline Systems
Unsupervised Baseline Supervised Baseline
Tf-idf LDA Word Pairs LDA induced topic pairs
Unsupervised Baselinesa) The Tf-idf baseline: If cosine similarity value between two
documents is greater than given threshold value then it ispositive.
b) The LDA baseline: Each entry in document has certainprobability such that it belongs to one of the topics ofn(length of the document) and apply cosine similarity asabove method.
Note: Here LDA is trained to produce n topics.
16
Supervised Baseline
• Instance is pair of high-level and low-level requirements.
• Instance is positive then two requirements are linked otherwiseit is negative.
• Instances can be represented using two types of features:
a) word pairs: Instance is pair of words taken from traininginstances.
b) LDA-induced topic pairs: Instance is pair of features and it ispositive if both features are most probable topics in high andlow-level requirements.
Note: Here LDA is trained with additional parameter C toproduce n topics.
17
Exploiting Rationales
Extension:
• Generating extra training instances i.e. pseudo instances, weneed to adopt extension to baseline systems.
• We employ a binary SVM classifier on training data set withlinear kernel and setting all parameters to default valuesexpect C parameter.
Evaluation:
• Dataset is five fold cross validation in which three folds fortraining data, one fold for development set and one fold forevaluation.
• F-score on dev set give performance of the classifier. 18
Rationale in Traceability Prediction
• According to Zaidan et al, Rationale is a human-annotated textfragment that motivated an annotator to assign a particularlabel to training document.
• In traceability prediction rationales are identified only forpositive instances.
• In traceability prediction, negative instances are because ofabsence of evidence that two requirements involved shouldbe linked rather than presence of evidence that they shouldnot be linked.
19
Creating Negative Pseudo Instances
• Steps for creating negative pseudo instances:
i) Select pair of linked requirements.
ii) Remove rationale from both requirements. Only negativeinstances will remain.
iii) Remaining text fragments create pseudo instances which arenegative in nature.
iv) From each pair of positive instances, three types of negativepseudo instances are possible:
a) Removing all and rationales from high-level requirements.
b) Removing all and rationales from low-level requirements.
c) Removing all rationales from both requirements.20
Creating Positive Pseudo Instances
• Steps for creating positive pseudo instances:
i) Select pair of linked requirements.
ii) Remove text fragments which are not part of rationale in pair.
iii) Reaming pseudo instances are positive pseudo instances.
iv) Add a constraint to the SVM learner to classify pseudo instances with less confidence.
21
Soft-margin SVM formulation
i) Positive instances:
ii) Positive pseudo instances:
iii) Negative pseudo instances:
• Xi = Training example
• C = error penalty
• Vi ,uij = pos/neg pseudo instances created from Xi
• Ci = { -1,+1} class label
• ξi = slack variable
• μ = margin size
22
Exploiting an Ontology
• For generating additional features we employ SVM learner tohand-built ontology contains verb and noun clusters.
• In this, each training instance is
i) from high-level and low-level requirements
ii) from the list of Ontology.
23
24
Ontology Based Features
Verb pairsNoun pairs Verb group pairs Noun group pairs Dependency pairs
focus on verbs/Nouns that relevant to traceability prediction
Replace verbs/Nouns with cluster id’sCreate binary file with cluster id’sBest performance
Combination of verb and nounUse Stanford dependency parser Connected by dependency relation
Learning the Ontology
Is it possible to learn an ontology rather than hand-buildingit?
Yes, it involves 3steps procedure:
Step1: verb/noun selection
Select verbs, nouns, noun phrases from training data in such
way that
a) should appear more than once
b) it contains at least three characters. Ex: be, is.
c) should appear in high level but not in low level and vice
versa.
25
Learning the Ontology• Step2: Verb/Noun representation
a) Represent each verb with set of nouns/NPs using Stanforddependency parser.
b) similarly noun with set of verbs collected in step1.
• Step3: Clustering-
a) Apply clustering to both verb and noun clusters separatelyusing single-link algorithm.
b) This algorithm merges two most similar clusters usingsimilarity measurement and stops when it reaches desirednumber of clusters.
It gives induced number of clusters for given datasets.26
Evaluation
• In evaluation, we compare F-score of different methodswhich depends on combination of noun clustering and verbclustering and C value.
• F-score depend on two terms:
i) Recall (R) :- It is percentage of links in the gold standard thatare recovered by our system.
ii) Precision (P) :- It is percentage of links recovered by oursystem that are correct.
• F-score is harmonic mean of recall and precision.
27
Result of Supervised Systems
28
Conclusion• Traceability prediction is crucial task with annotator rationale
and ontology.
• Supervised baseline techniques reduces relative error by 11.1-19.7% compared to baseline techniques.
• F-score is competitive in between manual clusters andinduced clusters.
• The results might change depending on datasets.
29
30
31
top related