towards applying text mining techniques on software quality standards and models

1

Towards Applying Text Mining Techniques

on Software Quality Standards and Models

Zador Daniel Kelemen1, Rob Kusters2, Jos Trienekens2, Katalin Balla3, 4 1ThyssenKrupp Presta Hungary

[email protected], 2Eindhoven University of Technology

[email protected], [email protected], 3Budapest University of Technology and Economics

[email protected],

4SQI – Hungarian Software Quality Consulting Institute

[email protected]

Abstract Many of quality approaches are described in hundreds of textual pages

(see CMMI, SPICE, Enterprise SPICE, ITIL among others). Manual processing

of information consumes plenty of resources. In this report we present a text

mining approach applied on CMMI – one well known and widely known quality

approach. The text mining analysis can provide a quick overview on the scope of

a quality approaches. The result of the analysis could accelerate the

understanding and the selection of quality approaches.

Keywords: quality approach, text mining, CMMI, word frequency, scope, model,

standard, improvement framework, software quality

1 Introduction

As we discussed in [1], there are several solutions for choosing quality approaches.

Unfortunately, these are not often updated to include new (and new versions of)

quality approaches, therefore evolvement of easily usable quantitative techniques

could fill an important gap in understanding focus of quality approaches by

providing a quick overview on the quality approaches to be used. The application

of quantitative tools could serve a good starting point especially in applying long-

described quality approaches such as CMMI, ITIL, SPICE or Enterprise SPICE.

In this report we present preliminary results of applying a text mining on

software quality models and standards.

According to [2], we call each software quality standard, method, technique,

model and (improvement) framework to improve software processes software

quality approach.

In this report we apply only some of the basic text mining techniques: such

as tokenization, stopwords filtering, world stemming, truncation and

mailto:[email protected]





https://www.researchgate.net/publication/259529179_Process_Based_Unification_for_Multi-Model_Software_Process_Improvement?el=1_x_8&enrichId=rgreq-144b13d4-e445-4c14-aa9f-8f971706bf57&enrichSource=Y292ZXJQYWdlOzI1NjQ0NTYyMDtBUzo5NzI3MTYwOTgyMzI0MUAxNDAwMjAyNzY3ODQw

https://www.researchgate.net/publication/230531919_Identifying_criteria_for_multimodel_software_process_improvement_solutions_-_based_on_a_review_of_current_problems_and_initiatives?el=1_x_8&enrichId=rgreq-144b13d4-e445-4c14-aa9f-8f971706bf57&enrichSource=Y292ZXJQYWdlOzI1NjQ0NTYyMDtBUzo5NzI3MTYwOTgyMzI0MUAxNDAwMjAyNzY3ODQw

2

canonization. The application of N-gramms and further more advanced text

mining techniques are not included into this report.

Previously, we analysed structure and elements of 7 different process based

software quality approaches in [1] and some other versions in [3], [4]. However,

quantitative analysis quality approaches were not provided and are not common

in the literature, even though these could contribute to better understanding and

processing and even in comparison [5] of quality approaches [6], [7]. Preliminary

complexity analyses were presented in [1], [8]– [13]. These investigations of

quality approaches are mainly focusing on cross-references and interrelations

inside quality approaches (among their elements or element instances). In this

report we present preliminary results of applying text mining techniques on

CMMI[11], [14].

In section 2 we briefly present the text mining approach, in section 3 the text

mining is presented. Section 4 describes preliminary results of the analysis.

Limitations of this work are discussed in 5 and conclusion is included into 6.

2 Approach

In this chapter we present our approach of applying a simple text mining

technique on CMMI. This approach is general and can be applied to any

document and thus to other quality approaches.

Figure 1 – CMMI versions and constellations and their elements. Source: [15]

The current version of CMMI, v1.3 defines 3 constellations: CMMI for

Development[16], CMMI for Services[17] and CMMI for Acquisition [18]. Figure

1 presents a summary of CMMI versions 1.1-1.3, constellations and their

elements, showing that constellations have about 400-500 pages each, and version

1.3 has 1440 pages in total. Standards such as SPICE, Enterprise SPICE, and

TMMi among others have similar size. The amount of information in these



https://www.researchgate.net/publication/228946579_Towards_supporting_simultaneous_use_of_process-based_quality_approaches?el=1_x_8&enrichId=rgreq-144b13d4-e445-4c14-aa9f-8f971706bf57&enrichSource=Y292ZXJQYWdlOzI1NjQ0NTYyMDtBUzo5NzI3MTYwOTgyMzI0MUAxNDAwMjAyNzY3ODQw

3

standards makes it difficult to understand and apply them; therefore in this

report we show preliminary results of analysing CMMI from text mining

perspective.

In order to get an overview about the most frequent words in CMMI we

defined and performed the following steps:

1. Selecting and filtering relevant document parts to be analysed

2. Analysing relevant documents

a) Removing useless document parts

b) Tokenization

c) Filtering stopwords

d) Transforming canonization

e) Truncation

3. Understanding results

For complexity analysis we used a free text mining tool Rapid Miner [19].

Figure 2 shows a generic process for analysing documents with a text mining

tool.

3 Performing text mining on CMMI

In this section we address steps defined in section 2. Figure 2 summarizes the

steps performed (screenshot from RapidMiner – the tool used for text mining).

Figure 2 – Steps of filtering most frequent words in a document

1. Selecting and filtering relevant document parts to be analysed

In order to completely analyse CMMI we selected all constellations (CMMI-DEV,

CMMI-SVC, and CMMI-ACQ) of version 1.3.

2. Analysing relevant documents

2a Removing useless document parts – useless parts of CMMI were removed (e.g.

figures, formatting)

4

2b Tokenization – Tokenization can be performed in two ways: (1) by single

words and (2) by N-grams (expressions including multiple words). In this quick

search we primarily focused on single word-count analysis.

2c Filtering stopwords

We found that several words are common in documents and could mislead the

result of text mining. These are called stopwords in text mining. Such words

were: page (appeared on each page of each document), “this”, “that” among many

others. For filtering stopwords we used two dictionaries:

a) A generic English dictionary included in the tool [19],

b) An additional self-defined dictionary for filtering further common, but

irrelevant words. We considered irrelevant those common words which have

no connection with the topic (e.g. this, that, etc.)

2d) Transforming canonization – in order to avoid different counting of upper

and lower case words we transformed all words to lower case.

2e) Truncation – several words and expressions are present in documents in

various forms (e.g. work, working), therefore in order to achieve a clear view, a

truncation of words can be performed. For truncation we used two well-known

algorithms Porter and Snowball, which gave similar results.

4 Preliminary results

Table 1 in the appendix includes the list of 30 most frequently occurring

words and trunks (after applying Snowball on the wordlist). Most frequent words

(concepts) in CMMI are:

- Process (8946 words, 10853 trunks),

- Product (2338 words, 4370 trunks),

- Work (3706 words, 3751 trunks),

- Project (3170 words, 3556 trunks),

- Service (2934 words, 4219 trunks).

Results suggest that CMMI is not a clearly specific quality approach but it

is a more widely applicable one. It is also seems that CMMI is highly process

oriented quality approach (which is clearly stated in the model). Going through

the first 30 words and trunks it can be observed that the focus might also be on

organization, management, performance, suppliers, training, risks, planning and

measurement.

It is important to mention that preliminary results presented here show

rather a feasibility of such a quick text mining analysis than a decent and tested

result, thus further investigation is needed.

Preliminary results show that text mining tools could be used in practice in

understanding focus and selecting quality approaches.

5

5 Limitations

Here we summarize limitations of the approach presented in this report.

The three constellations of CMMI contain core process areas which appear

in all the three documents. These were duplicated and duplications were not

filtered. A later sentence duplication analysis of CMMI showed hundreds of

sentence duplications across CMMI constellations – these duplications should be

filtered and counted only once. CMMI contains expressions which contain

process, project and work (e.g. process area, work product, project planning,

project monitoring and control) – these expressions (or N-grams) influence

search results and must be taken into account.

Limitations of this initial quick search show that the text mining process

should be chosen carefully and should be developed systematically. Both results

and limitations motivate further research of applying text mining on quality

approaches.

The approach presented in this report needs further validation, especially in order

to strengthen external validity: it would be needed to perform the same text

mining approach on further quality approaches and possibly to combine with

more advanced text mining techniques. Another validation and further research

would be needed on the meaning and usage of most and less frequent words,

stems and N-gramms.

Both results and limitations motivate further research of applying complexity

analysis on quality approaches.

6 Conclusion

Limited number of literature deals with the quantitative analysis of quality

approaches, despite that this can be useful when understanding scope of the

quality approaches. Data mining tools and basic algorithms are already available

for text mining. In this report1 we presented preliminary results of applying basic

text mining techniques with the goals of understanding the scope of CMMI. The

CMMI was chosen as a widely known and accepted quality approach, however

further versions of same techniques may be applicable to other quality

approaches.

Meaning and usage of frequent and rare words, stems and expressions in

quality approach element instances require further research and validation. This

could be achieved through the analysis of further quality approaches and

refinement of the techniques applied. Concepts presented in this report can also

be used in building and refining automated software quality tools such as the

Quality Organizer[23] or HProcessTool[24].

1 This report has been published as a part of a PhD thesis [1] and it is part of a

series of technical reports which include: [13], [20]– [22].

https://www.researchgate.net/publication/228982256_Quality_Organizer_a_support_tool_in_using_multiple_quality_approaches?el=1_x_8&enrichId=rgreq-144b13d4-e445-4c14-aa9f-8f971706bf57&enrichSource=Y292ZXJQYWdlOzI1NjQ0NTYyMDtBUzo5NzI3MTYwOTgyMzI0MUAxNDAwMjAyNzY3ODQw

6

Appendices

A. Most frequent words in CMMI v1.3

Table 1 – list of 30 most frequent words and trunks in CMMI v1.3

Tokenized wordlist Truncated (Snowball) wordlist

# Word No of

documents

Word

count Word

No of

documents

Word

count

1 process 3 8946 process 3 10853

2 work 3 3706 product 3 4370

3 project 3 3170 servic 3 4219

4 service 3 2934 work 3 3751

5 cmmi 3 2682 project 3 3556

6 management 3 2532 perform 3 3501

7 performance 3 2437 manag 3 3459

8 requirements 3 2406 requir 3 3022

9 product 3 2338 plan 3 2988

10 organization 3 2194 area 3 2930

11 area 3 2044 cmmi 3 2682

12 products 3 1903 organ 3 2546

13 processes 3 1879 includ 3 2319

14 organizational 3 1641 measur 3 2124

15 information 3 1589 risk 3 2089

16 version 3 1577 develop 3 2017

17 objectives 3 1545 establish 3 1969

18 include 3 1538 improv 3 1924

19 analysis 3 1366 exampl 3 1863

20 supplier 3 1359 object 3 1798

21 data 3 1298 inform 3 1769

22 services 3 1285 supplier 3 1714

23 training 3 1274 organiz 3 1650

24 development 3 1262 level 3 1638

25 quality 3 1261 identifi 3 1636

26 risk 3 1225 use 3 1603

27 plan 3 1215 version 3 1594

28 activities 3 1203 select 3 1567

29 level 3 1113 practic 3 1549

30 system 3 1110 model 3 1446

7

References

[1] Z. D. Kelemen, Process Based Unification for Multi-Model Software Process Improvement. Eindhoven: Technische Universiteit Eindhoven, 2013.

[2] Z. D. Kelemen, R. Kusters, and J. Trienekens, “Identifying criteria for multimodel software process improvement solutions - based on a review of

current problems and initiatives,” J. Softw. Evol. Process, vol. 24, no. 8,

pp. 895–909, Dec. 2012.

[3] Z. D. Kelemen, K. Balla, J. Trienekens, and R. Kusters, “Structure of Process-Based Quality Approaches - Elements of a research developing a

common meta-model for proces-based quality approaches and methods,” in Proceedings of EuroSPI 2008 Doctoral Symposium, Dublin, Ireland, 2008.

[4] Z. D. Kelemen, K. Balla, J. Trienekens, and R. Kusters, “Towards

supporting simultaneous use of process-based quality approaches,” in Proceedings of 9th international Carpathian Control Conference: ICCC 2008, Sinaia, Romania, 2008, pp. 291–295.

[5] Z. D. Kelemen and K. Balla, “A CMMI-DEV v1.2 és az ISO 9001:2000

kapcsolata,” Magy. Minőség, vol. 17, no. 2, pp. 27–40, 2008.

[6] S. Jeners, H. Lichter, and E. Pyatkova, “Automated Comparison of Process

Improvement Reference Models Based on Similarity Metrics,” 2012, pp.

743–748.

[7] S. Jeners and H. Lichter, “Smart Integration of Process Improvement

Reference Models Based on an Automated Comparison Approach,” in Systems, Software and Services Process Improvement, vol. 364, F.

McCaffery, R. O’Connor, and R. Messnarz, Eds. Springer Berlin Heidelberg,

2013, pp. 143–154.

[8] X. Chen, M. Staples, and P. Bannerman, “Analysis of Dependencies

between Specific Practices in CMMI Maturity Level 2,” in Software Process Improvement, vol. 16, R. V. O’Connor, N. Baddoo, K. Smolander, and R. Messnarz, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp.

94–105.

[9] A. L. Ferreira, R. J. Machado, and M. C. Paulk, “Size and Complexity

Attributes for Multimodel Improvement Framework Taxonomy,” in Software Engineering and Advanced Applications (SEAA), 2010 36th EUROMICRO Conference on, 2010, pp. 306 –309.

[10] A. L. Ferreira, R. J. Machado, and M. C. Paulk, “Quantitative Analysis of

Best Practices Models in the Software Domain,” presented at the 2010 Asia Pacific Software Engineering Conference, 2010.

[11] Z. D. Kelemen, “What is important in CMMI and what are the

interrelations among its elements?,” presented at the International ODF Symposium, Budapest, 28-Jun-2011.

[12] P. Monteiro, R. J. Machado, R. Kazman, and C. Henriques, “Dependency

Analysis between CMMI Process Areas,” in Product-Focused Software Process Improvement, vol. 6156, M. Ali Babar, M. Vierimaa, and M. Oivo,

Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 263–275.

[13] Z. D. Kelemen, R. Kusters, J. Trienekens, and K. Balla, “Towards

Complexity Analysis of Software Process Improvement Frameworks,” Budapest, Technical Report TR201301, Sep. 2013.

[14] K. Balla and Z. D. Kelemen, “Important Concepts In CMMI and What is

Difficult to Understand,” presented at the SEPG Europe 2011, Dublin, 07-Jun-2011.

8

[15] E. Forrester and G. Wemyss, “CMMI Version 1.3 and Beyond,” Dublin, Ireland, 2011.

[16] C. S. CMMI Product Team, “CMMI for Development, Version 1.3.” CMU SEI, Nov-2010.

[17] C. S. CMMI Product Team, “CMMI for Services, Version 1.3.” CMU SEI, Nov-2010.

[18] C. S. CMMI Product Team, “CMMI for Acquisition, Version 1.3.” CMU SEI, Nov-2010.

[19] Rapid - I, “RapidMiner,” 2012. [Online]. Available: http://rapid-i.com/content/view/181/190/. [Accessed: 29-May-2012].

[20] Z. D. Kelemen, R. Kusters, J. Trienekens, and K. Balla, “A Data Model for

Multimodel Process Improvement,” Budapest, Technical Report TR201303, Sep. 2013.

[21] Z. D. Kelemen, R. Kusters, J. Trienekens, and K. Balla, “Towards Applying

Text Mining Techniques on Software Quality Standards and Models,” Budapest, Technical Report TR201302, Sep. 2013.

[22] Z. D. Kelemen, R. Kusters, J. Trienekens, and K. Balla, “Selecting a Process Modeling Language for Process Based Unification of Multiple Standards

and Models,” Budapest, Technical Report TR201304, Sep. 2013.

[23] Z. D. Kelemen, K. Balla, and G. Boka, “Quality Organizer: a support tool

in using multiple quality approaches,” in Proceedings of 8th International Carpathian Control Conference (ICCC 2007), Strbske Pleso, Slovakia, 2007.

[24] C. Pardo, F. Pino, F. García, F. R. Romero, M. Piattini, and M. T.

Baldassarre, “HProcessTOOL: A Support Tool in the Harmonization of

Multiple Reference Models,” in Computational Science and Its Applications - ICCSA 2011, vol. 6786, B. Murgante, O. Gervasi, A. Iglesias, D. Taniar, and B. O. Apduhan, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg,

2011, pp. 370–382.

towards applying text mining techniques on software quality standards and models

Documents