department of computing and software - mcmaster … · department of computing and software january...

Department of Computing and SoftwareFaculty of Engineering — McMaster University

Comparing Psychometrics Software Development BetweenCRAN and Other Communities

by

S. Smith, Y. Sun and J. Carette

C.A.S. Report Series CAS-15-01-SSDepartment of Computing and Software January 2015Information Technology BuildingMcMaster University1280 Main Street West Hamilton, Ontario, Canada L8S 4K1

Copyright c© 2015

Comparing Psychometrics SoftwareDevelopment Between CRAN and Other

Communities

Spencer Smith∗, Yue Sun and Jacques Carette

Technical Report CAS-15-01-SSDepartment of Computing and Software

McMaster University

January 23, 2015

Abstract

Our study looked at the state of practice for developing psychometrics softwareby ranking 30 software tools with respect to their quality and adherence to soft-ware engineering best practices. Our goal was to identify the most successfuldevelopment approaches and areas for improvement. We compared packageshosted by the Comprehensive R Archive Network (CRAN) to packages devel-oped in other environments. Our analysis showed that R packages use goodsoftware engineering practices to improve maintainability, reusability, under-standability and visibility; commercial software packages tend to provide thebest usability, but do not do as well on verifiability; and, software for researchprojects are not uniform in terms of their organization or consistency. The soft-ware development process, tools and policies used by CRAN provide an infras-tructure that facilitates success, even for small development teams. However,R extensions, and code written in other languages, could improve verifiabilityand understandability if Literate Programming (LP) were used to maintain thesource code and its documentation together in one source.

Keywords: Software quality, Analytic Hierarchy Process (AHP), R, Psycho-metrics

∗Computing and Software Department, McMaster University, [email protected]

1

1 Introduction

As observed by Wilson et al. (2013), scientists frequently develop their own software be-cause domain-specific knowledge is critical for the success of their applications. However,many of these scientists are self-taught programmers and may be unware of software devel-opment best practices. To help remedy this situation, Wilson et al. (2013) provide generaladvice for scientific software development. The current paper looks at how well this generaladvice is applied in the specific scientific domain of psychometrics.

A first look at the state of practice for software in a specific scientific community isprovided by Gewaltig and Cannon (2012), for the computational neuroscience domain. (Anewer version of their paper is available (Gewaltig and Cannon, 2014), but we reference theoriginal version, since its simpler software classification system better matches our needs.)Gewaltig and Cannon (2012) provide a high level comparison of existing neurosciencesoftware, but little data is given on the specific metrics for their comparison. We build onthe idea of studying software built and used by a specific scientific community, while alsoincorporating detailed measures of software qualities. In this paper, we use the softwareengineering term software qualities, to refer to properties of software such as installability,reliability, maintainability, portability and so on (see Section 2.2 for more details). Whenwe speak of software qualities, we mean the union of these properties.

Here we target 30 packages in the psychometrics domain, developed by different commu-nities using different models. The packages were selected from a combination of three ex-ternal lists: Wikipedia (2014), the National Council on Measurement in Education NCME(2014), and the Comprehensive R Archive Network CRAN (2014b).

Combining our own ideas and suggestions from Wilson et al. (2013) and Gewaltig andCannon (2012), we created a grading sheet to systematically measure each software pack-age’s qualities. After grading each software package, we used the Analytic Hierarchy Pro-cess (AHP) (Saaty, 1994) to quantify the ranking between packages via pair-wise compar-isons.

Unlike Gewaltig and Cannon (2012), the authors of this study are not domain experts.Our aim is to analyze psychometrics software with respect to software engineering aspectsonly. Due to our lack of domain knowledge, algorithms and background theory will not bediscussed, and software packages will not be judged according to the different functionalitiesthey provide. We reach our conclusions through a systematic and objective grading process,which incorporates some experimentation with each software packages.

Background information is provided in Section 2. The experimental results and basiccomparisons between software packages are explained in Section 3, along with informa-tion on how the software developed by the CRAN community compares to psychometricssoftware developed by other communities. Conclusions and recommendations are found inSection 4.

CAS-15-01-SS 2

2 Background

This section covers the process used for our study and the rational behind it. We alsointroduce the terms and definitions that are used to construct the software quality gradingsheet, and the AHP technique, which we used to make the comparisons.

The overall process of our project is as follows:

1. Choose the scientific computing related domain. Here we chose psychometrics as ourdomain. Psychometrics was chosen because there is an active software community,as evidenced by the list of open source software summarized by NCME (2014), and alarge number of R packages hosted by CRAN (2014b). Moreover, we can benefit byhaving access to the code in many instances, since much of the software is developednon-commercially and is open source.

2. Pick 30 software packages from an authoritative list. For the psychometrics domainthis is the combination of NCME, CRAN and Wikipedia lists mentioned previously.Our idea is to pick 15 R packages hosted by CRAN, and another 15 developed byother communities or groups. Software packages that are common to at least two ofthe lists (15 out of the 30 packages) were selected first, with the remaining softwarepackages selected randomly.

3. Build our software quality grading sheet. The software grading template is given inAppendix A and is available at: https://github.com/adamlazz/DomainX.

4. Grade each selected software using the software quality grading sheet. Our maingoal here is to remain objective. To ensure that our process is reproducible, wedid an experiment by asking different people to grade the same software package.The result showed that different people’s standards for grading vary, but the overallranking of software remained the same, since the overall ranking is based on relativecomparisons, and not absolute grades.

5. Apply AHP (Section 2.3) on the grading sheet. As was mentioned above, people mayhave different standards about grades. The purpose of AHP here is to reduce theimpact of these differences.

6. The last step is to analyze the AHP results, using a series of different quality weight-ings, so that conclusions and recommendations can be made.

2.1 Types of software

We classified the selected software packages with respect to their development model. Thereare three kinds of software that we investigated:

1. Open source software: “Computer software with its source code made available andlicensed so the copyright holder provides the rights to study, change and distributethe software to anyone for any purpose” (Laurent, 2004).

CAS-15-01-SS 3

2. Freeware software: “Software that is available for use at no monetary cost, but withone or more restricted usage rights such as source code being withheld or redistribu-tion prohibited” (LINFO, 2006).

3. Commercial software: “Computer software that is produced for sale or that servescommercial purposes” (Dictionary, 2014).

We captured the status of each software project using the following definitions:

• Alive: Software package, related documentation or website has been updated withinthe last 18 month.

• Dead: The last update of the above information was 18 month ago or longer.

• Unclear: No clear information related to the date is given.

2.2 Software qualities

We use the software engineering terms from Ghezzi et al. (2002) and best practices fromWilson et al. (2013) to inspire our software qualities and quality measures. The qualitiesconcern both end users and developers. Since some terms are not always used with thesame meaning in the software engineering literature, we provide our definitions below. Thequalities are presented in the order of a typical set of experiments for measuring softwarequality. Where relevant, information on how the quality will be measured in the currentstudy is given.

• Installability is a measure of the ease of software installation. The quality is largelydetermined by the amount and quality of information provided by developers. Goodinstallability means detailed and well organized installation instructions, less workto be done by users and automation whenever possible.

• Correctness and Verifiability are related to how much a user can trust the software.Software that makes use of trustworthy libraries (those that have been used by othersoftware packages and tested through time) can bring more confidence to users anddevelopers than self-developed libraries. Carefully documented specifications shouldalso be provided. Specifications allow users to understand the background theoryfor the software, and its required functionality. Well explained examples (with in-put, expected output and instructions) is helpful too, so that users can verify thatthe software can produce the same result on their own machine as expected by thedevelopers.

• Reliability is based on the dependability of the software. Reliable software has ahigh probability of meeting its stated requirements under a given usage profile overa given span of time.

CAS-15-01-SS 4

• Robustness is defined as whether a software package can handle unexpected input.A robust software package should recover well when faced with unexpected input.

• Performance is a measure of how quickly a solution can be found and the resourcesrequire for its calculation. Given the constraint that we are not domain experts,in the current context we are simply looking to see whether there is evidence thatperformance is considered. Potential evidence includes a makefile (or other sourcefile) that shows evidence of a profiler.

• Usability is a measure of how easy the software is to use. This quality is related tothe quality and accessibility of information provided by the software and its develop-ers. Good documentation helps with usability. Some important documents includea user manual, a getting started tutorial, and standard examples (with input, ex-pected output and instructions). The documentation should facilitate a user quicklyfamiliarizing themselves with the software. The GUI should have a consistent lookand feel for the platform it is on. Good visibility (Herr, 2002) can allow the user tofind the functionality they are looking for more easily. A good user support model(e.g. forum) is beneficial as well.

• Maintainability is a measure of the ease of correcting and updating the software. Thebenefits of maintainability are felt by future contributors (developers), as opposedto end users. Keeping track of version history and change logs facilitates developersplanning for the future and diagnosing future problems. A developer’s guide is neces-sary, since it facilitates new developers doing their job in an organized and consistentmanner. Use of issue tracking tools and concurrent version system is a good practicefor developing and maintaining software (Wilson et al., 2013).

• Reusability is a measure of the ease with which software code can be used by othersoftware packages. In our project, we consider a software package to have goodreusability when part of the software is used by another package and when the API(Application Program Interface) is documented.

• Portability is the ability of software to run on different platforms. We examine asoftware package’s portability through developers’ statement from their website ordocuments, and the success of running the software on different platforms.

• Understandability (of the code) measures the quality of information provided to helpfuture developers with understanding the behaviour of the source code. We surfacecheck the understandability by looking at whether the code uses consistent indenta-tion and formatting style, if constants are not hard coded, if the code is modularized,etc. Providing a code standard, or design document, helps people become familiarwith the code. The quality of the algorithms used in the code is not considered here.

• Interoperability is defined as whether a software package is designed to work withother software, or external systems. We checked whether that kind of software orsystem exists and if an external API document is provided.

CAS-15-01-SS 5

• Visibility/Transparency is a measure of the ease of examining the status of the de-velopment of a software package. One thing to check is whether the developmentprocess is defined in any document. Another thing to check is the examiner’s overallfeeling about the ease of accessing information about the software package. Goodvisibility allows new developers to quickly make contributions to the project.

• Reproducibility is a measure of whether related information, or instructions, are givento help verify a products’ results (Davison, 2012). Psychometrics is a scientific com-puting related domain; the software results need to be reproducible, since this bringsconfidence to the science. Documentation of the process of verification and validationis required, including details of the development and testing environment, operatingsystem and version number. If possible, test data and automated tools for capturingthe experimental context should be provided.

We built a grading sheet, as shown in Appendix A, using the qualities listed above.Each selected software package was marked using the sheet. A sample of the metrics usedfor the quality of installability is given in Table 1.

Question (Allowed responses)

Are there installation instructions? (Yes/ No)Are the installation instructions linear? (Yes/ No)Is there something in place to automate the installation? (Yes∗/ No)Is there a specified way to validate the installation, such as a test suite? (Yes∗/ No)How many steps were involved in the installation? (Number)How many software packages need to be installed before or during installation?(Number)Run uninstall, if available. Were any obvious problems caused? (Unavail/Yes∗/ No)

Table 1: Installability metrics, where Unavail means uninstall is not available and thesuperscript ∗ means that this response should be accompanied by explanatory text.

2.3 Analytic Hierarchy Process (AHP)

“The AHP is a decision support tool which can be used to solve complex decision prob-lems. It uses a multi-level hierarchical structure of objectives, criteria, subcriteria, andalternatives. The pertinent data are derived by using a set of pairwise comparisons. Thesecomparisons are used to obtain the weights of importance of the decision criteria, andthe relative performance measures of the alternatives in terms of each individual decisioncriterion.” (Triantaphyllou, 1995). By using AHP, we can compare between qualities andsoftware packages without worrying about different scales, or units of measurement.

Generally, by using AHP, people can evaluate n options with respect to m criteria. Thecriteria can be prioritized, depending on the weight given to them. An m × m decision

CAS-15-01-SS 6

matrix is formed where each entry is the pair-wise weightings between criteria. Then, foreach criterion, a pairwise analysis is performed on each of the options, in the form of ann × n matrix a (there is a matrix a for each criteria). This matrix is formed throughusing a pair-wise score between options as each entry. The entry of the upper triangle ofeach matrix is scaled between one to nine, the meaning of each score is defined as follows(Triantaphyllou, 1995):

• 1 means equal importance. Two activities contribute equally to the objective.

• 3 means weak importance of one over another. Experience and judgment slightlyfavor one activity over another.

• 5 means essential or strong importance. Experience and judgment strongly favor oneactivity over another.

• 7 means demonstrated importance. An activity is strongly favoured and its domi-nance demonstrated in practice.

• 9 means absolute importance. The evidence favouring one activity over another is ofthe highest possible order of affirmation.

• 2, 4, 6, 8 are defined as intermediate values between the two adjacent judgments.

In our project, we compare the 30 selected software packages (n = 30) with respect tothe 13 software qualities (m = 13) mentioned in Section 2.2. With respect to comparingthe overall quality, the importance of each software quality in a given project will dependon the specifics of the project. To approximate this, we experiment with different weightsfor each property, so that we can investigate how software packages compare under differentpriorities.

We capture a score, from one to ten, for each software package for each criteria throughthe grading process mentioned in Section 2.1. We turn these scores to pair-wise scoresusing the following algorithm:

1. For the given quality, start with 2 scores (one for each package): A and B.

2. If A = 10 and B = 1, or A = 1 and B = 10, the value that is 10 is reset to 9.

3. When A ≥ B, Pair-wise score = A−B + 1.

4. When A < B, Pair-wise score = 1/(B − A + 1).

For example, if installability is measured as an 8 for package A and a 2 for package B,then the entry in matrix a corresponding to A versus B is 7, while the entry correspondingto B versus A is 1/7. The implication is that package A is much preferred over package Bfor the quality of installability.

CAS-15-01-SS 7

3 Experimental Results

In this section, the 30 selected software package will be briefly introduced, the AHP resultsassuming equal weights between qualities will be presented, and the trends identified ineach quality will be discussed. The detailed grading results are in Appendix B and availableon-line at https://github.com/adamlazz/DomainX.

3.1 Summary of Software

The selected 30 psychometrics software packages as summarized in Tables 2–4. From thesetables we observed the following:

• 19 packages are open source; 8 are freeware; and 3 are commercial (trial version only).

• 15 are developed using R (Table 2) and maintained by the CRAN community; 12are from university projects, or developed by research groups (Table 3); and 3 aredeveloped by companies for commercial use (Table 4).

• 3 projects use C++; 2 use Java; 2 use Fortran, and 1 uses BASIC. The programminglanguage of the remaining 7 software packages are not mentioned by the developers.

• All the software packages from CRAN (15/15) are alive, using the definition given inSection 2.1. Two of the commercial software product are alive and one is unclear. Forthe rest of software packages (mostly from university projects or research groups),6/12 are alive, 5/12 are dead and 1/12 is unclear.

3.2 AHP results

The AHP results with respect to each quality are shown and explained in the subsectionsbelow. In the charts used to display the results, blue bars (using slashes) stand for packageshosted by CRAN, green bars (using horizontal lines) stand for research groups projects andred bars (using vertical lines) stand for commercial software.

3.2.1 Installability

We examined the installability of each software by installing them on virtual machines.The advantage of virtual machine is that they provide a clean environment for the softwareinstallation, so there is no bias created by a pre-existing build environment.

We checked if there are install instructions and whether these instructions are organizedlinearly. Automated tools for installation and test cases for validation are preferred. Thenumber of steps during installation and number of external libraries are counted. At theend of the installation and testing, if an uninstaller is available, we ran it to see if anyproblems were caused. Figure 1 shows the AHP result for installability. Some key findingsbetween the different development communities are as follows:

CAS-15-01-SS 8

Name Release Update Status Source Lang

eRm Mair et al. (2014) 2007 2014 Alive Available RPsych Revelle (2014) 2007 2014 Alive Available RmixRasch Willse (2014) 2009 2014 Alive Available Rirr Gamer et al. (2014) 2005 2014 Alive Available RnFactors Raiche and Magis(2014)

2006 2014 Alive Available R

coda Plummer et al. (2014) 1999 2014 Alive Available RVGAM Yee (2013) 2006 2013 Alive Available RTAM Kiefer et al. (2014) 2013 2014 Alive Available Rpsychometric Fletcher(2013)


ltm Rizopoulos (2014) 2005 2014 Alive Available Ranacor de Leeuw and Mair(2014)


FAiR Goodrich (2014) 2008 2014 Alive Available Rlavaan Rosseel et al. (2014) 2011 2014 Alive Available Rlme4 Bates et al. (2014) 2003 2014 Alive Available Rmokken van der Ark (2013) 2007 2013 Alive Available R

Table 2: CRAN (R) software packages

Figure 1: AHP result for installability

• R packages hosted by CRAN: The overall results of the installability of R are positive.The CRAN community provides a general installation instruction (CRAN, 2014a) for

CAS-15-01-SS 9

Name Released Updated Status Source code Language

ETIRM Wothke (2008) 2000 2008 Dead Available C++

SCPPNT Wothke (2007) 2001 2007 Dead Available C++

jMetrik jMetrik (2014) 1999 2014 Alive Not Available JavaConstructMap Berkeley (2012) 2005 2012 Dead Not Available JavaTAP TAP (2014) Unclear Unclear Unclear Not Available UnclearDIF-Pack DIF (2012) Unclear 2012 Alive Available FortranDIM-Pack DIM (2012) Unclear 2012 Alive Available FortranResidPlots-2 Liang et al. (2008) Unclear 2008 Dead Not Available UnclearWinGen3 Han (2013) Unclear 2013 Alive Not Available UnclearIRTEQ Han (2011) Unclear 2011 Dead Not Available UnclearPARAM Rudner (2012) Unclear 2012 Alive Not Available BasicIATA IATA (2014) Unclear 2014 Alive Not Available Unclear

Table 3: Other research group projects

Name Released Updated Status Source code Language

MINISTEP MINISTEP (2014) 1977 2014 Alive Not Available UnclearMINIFAC MINIFAC (2014) 1987 2014 Alive Not Available UnclearflexMIRT flexMIRT (2014) Unclear Unclear Unclear Not Available C++

Table 4: Commercial software packages

all the packages it maintains. However, in some case links to the general instructionsare not given in the software package page, which may cause confusion to the be-ginner users. The following packages addressed this problem by providing detailedinstallation information on their own website: TAM, anacor, FAiR, lavaan, lme4and mokken. All the packages installed easily and automatically. Most of the pack-ages have dependencies on other R packages, but the required packages can be foundand installed automatically. The uninstallation process is as easy as the installationprocess. A drawback of R packages for installability is that none of the packagesprovide a standard suite of test cases specifically for the verification of installation.lavaan provide a simple test example, without output data, which does show someconsideration toward verification of the installation process.

• Commercial software: The overall results of the installability of commercial softwareis as good as for the R packages. They have well organized installation instructions,automated installers and only a few tasks needed to be done manually. However,commercial software tends to have the same problem as the R packages – in noinstance is a standard suite provided specifically for the verification of the installation.

CAS-15-01-SS 10

• Research group projects: The results of installability in the research group projectsare uneven. For the good examples shown in Figure 1, they are of similar qualityto packages hosted by CRAN and the commercial products. Software that is rankedlower is because no installation instructions are given (5 out of 12), or the giveninstruction is not linearly organized (1 out of 12). The developers may think that theinstallation process is simple enough that there is no need for installation instruction.However, even for a simple installation, it is good practice to have instructions toprevent trouble for new users. Another common problem is lack of standard suite oftest cases, only ConstructMap and PARAM showed consideration of this issue.

3.2.2 Correctness and Verifiability

We checked the correctness and verifiability of each software package by examining therelevant documents and running the software on virtual machines. We checked for evidenceof trustworthy libraries and a requirements specification. With respect to the requirementsspecification, we did not require the document to be written in a strict software engineeringstyle. The specification was considered adequate if it described the relevant mathematics.We tried the given examples, if any, to see if the result produced on our machine matchedthe expected output. The results for correctness and verifiability of the psychometricsdomain are shown in Figure 2. The following observations were made:

Figure 2: AHP result for correctness and verifiability

• R packages hosted by CRAN: The overall results of the correctness and verifiabilityof R are positive. CRAN provides information on software package dependencies andreverse dependencies. Reusing packages lifts some of the burden from developers, andreused packages are generally more trustworthy than newly developed libraries. With

CAS-15-01-SS 11

respect to requirements specification, 10 out of the 15 R packages published technicalreports or papers about the mathematics used and how they relate to the software.All of the CRAN products have complete and consistent documentation because Rextensions must satisfy the CRAN Repository policy, which includes standardized in-terface documentation through Rd (R documentation) files. Rd is a markup languagethat can be processed to create documentation in LATEX, HTML and text formats(R Core Team, 2014). Although not required by the CRAN policy, some of the Rextensions also include vignettes, which provide additional information in a morefree format than the Rd documentation allows. Vignettes can act as user manuals,tutorials, and extended examples. Many vignettes are written with Sweave (Leisch,2002), which is a Literate Programming (LP) (Knuth, 1984) tool for R. All the Rpackages have examples about how to use the software, but 9 out of 15 did not pro-vide expected output data, and there was a small precision difference that occurredwhen running ltm and comparing to the expected results.

• Commercial software: Compared to R, commercial software provided less informationfor building confidence in correctness and verifiability. None of the selected packagesmentioned whether they used existing popular libraries during development. Neitherdid they provide requirement specification documents, or reference manuals. Thismay be a result of the nature of commercial software; therefore, we should not con-clude that these software packages are not correct or verifiable. The advantage ofcommercial software is they all provide standard examples with relevant output andall the calculated results produced from our machine matched the expected output.

• Research group projects: The result for research group projects are similar to thosefor commercial software. Only 2 of research groups mentioned that standard librarieswere used and none of the packages provides requirement specification documents orreference manuals. Examples are given in most case (11 out of 12), but 9 out of 11did not provide input or output data.

Little evidence of verification via testing was found among the three classes of software.The notable exception to this is the diagnostic checks done by CRAN on each of the Rextensions. To verify that the extensions will work, the R package checker, R CMD check,is used. The tests done by R CMD check include R syntax checks, Rd syntax and com-pleteness checks and, if available, example checks, to verify that the examples produce theexpected output (R Core Team, 2014). These checks are valuable, but they focus more onsyntax than on semantics; the R community has not seemed to fully embrace automatedtesting. The common development practices of R programmers does not usually includeautomatically re-running test cases (Wickham, 2011).

Of the three classes of software, the R software provides the most complete and consistentdocumentation, but there also seems to be a missed opportunity here. If there is a drawbackto the R documentation, it is that the documents only assist with the use of the tools,not their verification. The LP examples include calls to the R extension to illustratetheir use, but they do not generally use LP to show the R implementation along with its

CAS-15-01-SS 12

explanation, so that it can be inspected and verified. LP can be used to improve verifiabilityby maintaining the documentation and code together in one source (Nedialkov, 2010), butthe CRAN community does not appear to follow this approach.

3.2.3 Reliability (Surface)

We checked the reliability by installing and running the software on our virtual machineusing the instructions given with the software. In this way, we can find whether there isany problem when people behave as the developers expect.

Unlike in Section 3.2.1, here we are not looking at the steps for the installation, butrather the results of the installation; that is, we are looking at the reliability of the in-stallation by judging whether the installation breaks or not. We also examined whetherthere is any problem during initial tutorial testing. Our checks can only be considered asa surface measure of reliability, since we did not check reliability with respect to the codefor anything other than a small set of test cases.

The results for reliability of the psychometrics domain are shown in Figure 3. Thecomparison between different development community is as follows:

Figure 3: AHP result for reliability

• R packages hosted by CRAN: For 11 out of 15 R packages hosted by CRAN, noproblems happened during installation and initial tutorial testing. There were smallinstallation problems for eRm, Psych, anacor, FAiR because the install instructionswere not up to date with the software. The problems included dead URLs and a lackof required packages. Our suggestion is that developers should maintain their installinstructions, or point users to the general instruction given by CRAN.

CAS-15-01-SS 13

• Commercial software: Commercial software did extremely well in this quality. Therewere no problem during installation and initial tutorial testing when using the giveninstructions.

• Research group projects: Half of the research group projects were done perfectly. Theother half suffered from problems like makefile errors (SCPPNT), old instructions(WinGen3), or an absence of instructions.

3.2.4 Robustness (Surface)

We checked the robustness of the software by providing them with “bad” input. We wantedto know how well the software deals with situations that the developers may not have an-ticipated. For instance, we checked whether the software reasonably handles garbage input(a reasonable response may be an appropriate error message) and whether the softwaregracefully handles text input files where the end of line character follows a different con-vention than expected. Like reliability, this is only a surface check of this quality. Withreference to Figure 4, the comparison between different development communities for thequality of robustness is as follows:

Figure 4: AHP result for robustness

• R packages hosted by CRAN: 10 out of 15 R packages hosted by CRAN handleunexpected input reasonably; they provide information or warnings when the userenters invalid data. R packages do not use text files as input files; therefore, a changeof format in a text file is not a issue.

CAS-15-01-SS 14

• Commercial software: 2 out of 3 commercial software packages do well on robustness,but there is one case (MINISTEP) where there was no warning when the user triedan input with an invalid format. When this happened, the software crashed.

• Research group projects: 7 out of 12 research group projects had positive results forrobustness. The rest had problems like an inability to handle when the input file isnot present, or when it does not have the designated name.

3.2.5 Performance (Surface)

No useful evidence or trends were found in this quality; therefore, no comparison can bemade.

3.2.6 Usability (Surface)

We checked usability by looking for documentation that helps users with the software.Better usability means the users get more help from the developers. We checked for agetting started tutorial, for a fully worked example and for a user manual. Also, we lookedfor a reasonably well designed GUI, a clear declaration of the expected user characteristic,and for information about the user support model. The results for usability of the psycho-metrics domain are shown in Figure 5. With respect to a comparison between developmentcommunities, we observed the following:

Figure 5: AHP result for usability

• R packages hosted by CRAN: The overall results for R packages are good. Thanks tothe CRAN repository policy, Rd files and Sweave vignettes, the R extensions provide

CAS-15-01-SS 15

complete and consistent documentation to the users. With respect to usability, twonotably good examples are mokken and lavaan. These two R packages give detailedexplanations for their examples and provide well-organized user manuals. However, Rpackages have a common drawback – only a few (4 out of 15) provide getting startedtutorials. The common user support model is an email address, with 2 out of the 15(eRm and anacor) providing a discussion forum as well.

• Commercial software: Commercial software did better than the R packages on thisquality. They all have getting started tutorials, explanations for their examples andgood user manuals. The drawback for the commercial software is that their GUIs donot have the usual look and feel for the platforms they are on. For the user supportmodel, 2 out of 3 (MINISTEP and MINIFAC) provide user forums and a feedbacksection, in addition to email support.

• Research group projects: Research groups did not perform consistently with respectto this quality. As shown in Figure 5, software packages like IATA, ConstructMap,TAP: Test Analysis Program, provided the best examples for others to follow.

3.2.7 Maintainability

We checked maintainability by looking for evidence that the software has actually beenmaintained and that consideration was given to assisting developers with maintaining thesoftware. The evidence we looked for was a history of multiple versions of the software,documentation on how to contribute or review code, and a change log. In cases where therewas a change log, we looked for the presence of the most common types of maintenance(corrective, adaptive or perfective).

We were also concerned about the tools that the developers used. For instance, whatissue tracking tool was employed and does it show when major bugs are fixed? We alsolooked to see which concurrent versioning system is in use, as this is important to bothindustry and open source projects (Wilson et al. 2013). Effort toward clear, nonrepetitive,code is also considered as evidence for maintainability. The results for maintainability areshown in Figure 6. The conclusion drawn from the data are as follows:

• R packages hosted by CRAN: The overall result for the R packages are positive. All ofthem provide a history of multiple version of the software and give information abouthow the packages were checked. A few of them, like lavaan, lme4 give informationabout how to contribute. 9 out of 15 provide change logs, 4 out of 15 indicate use ofissue tracking tool (Tracker and GitHub) and concurrent versioning systems (SVNand GitHub). All of them consider maintainability in the code, with no obviousevidence of code clones.

• Commercial software: Because of the nature of commercial software, they did notshow much evidence of maintainability. 2 out of 3 provide history of multiple versionsof the software and change logs, but other information is usually not provided by the

CAS-15-01-SS 16

Figure 6: AHP result for maintainability

vendors. In this case, the measure we give may not be an accurate reflection of themaintainability of the commercial software.

• Research group projects: Research group projects did not show much evidence thatthey pay attention to maintainability. 5 out of 12 provide the version history oftheir software; 2 out of 15 give information about how to contribute and 3 out of 15provided change logs. None of them showed evidence of using issue tracking tools ora concurrent versioning system in their project.

3.2.8 Reusability

We checked reusability by checking whether any portions of the software is used by anotherpackage, or any evidence that reusability was considered in the design. The results forreusability of the psychometrics domain are as follows (See Figure 7 for AHP results):

• R packages hosted by CRAN: R packages have great reusability. There is clear evi-dence that 12 out of 15 selected software packages have been reused by other softwarepackages hosted by CRAN. Also, the Rd reference manual required by CRAN canserve as an API document (not strictly), which helps others to reuse the package.

• Commercial software: Due to the natural of commercial software, no accurate resultcan be presented here.

• Research group projects: For research projects, no evidence was found that theirsoftware packages have been reused. Two of them (ETIRM and SCPPNT) provideAPI documentation.

CAS-15-01-SS 17

Figure 7: AHP result for reusability

3.2.9 Portability

We checked portability by checking which platforms the software is advertised to work on,how people are expected to handle portability, and whether evidence in the documentationshows portability has been achieved. A figure summarizing the AHP results is not providedin this case, since there is little derivation between the grades for the products within thesame category. In summary, the results for portability of the psychometrics domain, basedon the AHP result are as follows:

• R packages hosted by CRAN: R packages handle portability consistently well. Theyare the top performers among the 30 selected software. They all kept different ver-sions of the software package for different platform. Also, software checking resultsfrom CRAN proved that software can work well on different platforms.

• Commercial software: The selected commercial software are all suggested to workon Windows, with little evidence that portability has been achieved. However, thisis not really a negative results since it does the appear vendors are interested intargeting other environments.

• Research group projects:The results here are similar to those for commercial software.

3.2.10 Understandability (Surface)

We checked how easy it is for people to understand the code. We focused on code formattingstyle, existence of the use of coding standards and comments, identifier and parameters

CAS-15-01-SS 18

style, use of constants, meaningful file names, whether code is modularized and whetherthere is a design document. The results for understandability of the psychometrics domainare summarized in Figure 8. The differences between the communities are as follows:

Figure 8: AHP result for understandability

• R packages hosted by CRAN: R packages do well on this quality, with several verygood examples (FAiR, lavaan, lme4). The common problem in R packages is thatno coding standard is provided and there are few comments in the code. If LP wereused to maintain the code and its documentation together, understandability couldbe greatly improved.

• Commercial software: Due to the natural of commercial software, no accurate resultcan be presented here.

• Research group projects: 8 out of 12 projects are closed source. For the eight closedsource project, no measure of understandability is given. The common problem forthe remaining open source projects is not specifying any coding standard.

3.2.11 Interoperability

We checked interoperability by looking for evidence for any external systems the soft-ware communicates or interoperates with, for a workflow that uses other software, and forexternal interactions (API) being clearly defined. The results for interoperability of thepsychometrics domain follow. (There is no interoperability figure here because the threegroups have similar results between all members):

CAS-15-01-SS 19

• R packages hosted by CRAN: The results of the R packages are much better thanthe other 2 types of software. R packages are positive in this quality because of thenature of R and the requirement of documentation from CRAN. There is no obviousevidence that these software packages communicate with external system, but it isvery clear that existing workflows use other R packages.

• No sign of interoperability is found in commercial software and research group projects.

3.2.12 Visibility/Transparency

We checked visibility by looking for a description of the development process. We also givea subjective score about the easiness of external examination relative to the average of theother products being considered. This score is intend to reflect how easy it is for otherusers to get useful informations from the software website or documents. The results forvisibility (Transparency) of the psychometrics domain are as follows (See Figure 9 for theAHP results):

Figure 9: AHP result for visibility/transparency

• R packages hosted by CRAN: No software packages provide information about thedevelopment process, but packages like lavaan provide websites outside of CRANwith detailed information about the software, including examples and instructions.The benefit of the external websites is that they provide more information to users;the downside is that maintaining information in two different places can be difficultfor developers. If an external website is used, developers need to make sure that allthe websites are up to date, and that they each provide links to the other.

CAS-15-01-SS 20

• Commercial software and research group projects: No software packages provideinformation about their development process for either of these categories, but infor-mation on the software is easily obtainable through websites and documentation.

3.2.13 Reproducibility

We checked reproducibility by looking for a record of the environment used for developmentand testing, test data for verification and automated tools used to capture experimentalcontext. The results for reproducibility of the psychometrics domain are as follows: Only Rpackages provide a record of the environment for testing (15 out of 15); the other productsdo not explicitly address reproducibility. R packages also benefit from the use of Sweavefor the vignettes (if present), since LP can be used to document reproducible results bymaintaining the code and its documentation together.

3.2.14 Overall Ranking With Equal Weight Between Qualities

In our first overall analysis we use the same weight for each quality in our AHP table,as shown in Figure 10. In this case, closed source software packages ranked lower forthe reason that open source software packages provide more information to developersfor qualities like maintainability, understandability, etc. In the case where all qualitieshave equal weight, R packages are dramatically superior to the products in the other twocategories.

Figure 10: Ranking with equal weight between all qualities

CAS-15-01-SS 21

3.2.15 Overall Ranking With Different Quality Weightings

To minimize the influence of the natural difference between open source and close sourcesoftware, we gave correctness & verifiability, maintainability, reusability, understandabilityand reproducibility low weights (they are assigned a value of 1, while all other qualities areassigned 9). The best example is still an R package, but the overall results do not favor Ras much as in the previous analysis. Figure 11 provides the AHP result in this case.

Figure 11: Ranking with minimal weight for correctness & verifiability, maintainability,reusability, understandability and reproducibility)

4 Conclusion and Recommendations

For the surveyed software, R packages have advantages over the other languages for qual-ities related to development, such as maintainability, reusability, understandability andvisibility. Commercial software, on the other hand, provides better usability, but does notshow as much evidence of verification. In terms of quality measurements, research projectsusually ranked lower and showed a larger variation in the quality measures between prod-ucts.

The other two communities, especially the research project community, can learn fromthe success of CRAN. The quality of R extensions greatly benefits from using Rd, Sweave,R CMD check and the CRAN Repository Policy. The policy, together with the supporttools, mean that a single developer project can have the quality of something developedby a much larger team. A small research project cannot usually set up an extensivedevelopment infrastructure and process, but by being a part of CRAN, even small teamshave something solid to build upon.

CAS-15-01-SS 22

As strong as the R community is, there is still room for improvement. In particular,the documentation of an R extension seems to put too much trust in the developmentteam. The documentation is aimed at teaching the use of the code to a new user, not toconvincing others developers that the implementation is correct. This attitude is evidencedby the way LP is used by the R extensions we surveyed. Rather than maintaining the codeand documentation together, in the spirit of Knuth’s original LP, the LP tool Sweave isused to present examples of the usage of the new extension. Another area for improvementfor the R community would be an increased usage of regressions testing supported byautomated testing tools. Regression testing can be used to ensure that updates have notbroken something that previously worked.

Further recommendations for improving the development of psychometric software (forall communities) are given below. The recommendations are given for each of the qualitiesconsidered in this study.

• Installability: For those that do not already, projects should provide detailed instal-lation instructions (especially for research group projects) and a standard suite oftest cases for verification of the installation.

• Correctness and Verifiability: The other communities should follow the example pro-vided by CRAN, in terms of the organization of the reference manual, requirementspecification, and information about the libraries used. With respect to user ex-amples, commercial software usually does a good job here. For instance, flexMIRTshows a wide variety of examples of its usage. Although CRAN facilitates inclusionof examples, they are not generally required by repository policy. CRAN may con-sider increasing the number of required examples, possibly through the mechanismof vignettes, to match what is being done by commercial software.

• Reliability: The documented user instructions, including installation instructions,need to be kept in sync with the software as it is updated. This was an issue forsome software packages, especially the non-commercial ones.

• Robustness: To enhance robustness, those programs that scored low on this qualityshould include additional testing to uncover the neglected cases. In many situationsthe graceful way to handle unexpected input is with a message to the user and areturn to the previous program state. Exception handling, which is available usingtrycatch in R, is a good programming technique to improve robustness.

• Usability: CRAN should require a detailed getting started tutorial in addition to, oras part of, the user manual. Commercial software should put more effort in designinga platform friendly GUI.

• Maintainability: For open source projects, a concurrent versioning system and is-sue tracking tool are strongly suggested. Information like a change log, or how tocontribute, should also be presented.

CAS-15-01-SS 23

• Reusability: For programs to be reusable, a well-documented API should be provided.If the generation of the documentation can be automated, there is a better chancethat this will be done, and that it will be in sync with the code.

• Understandability: Where not currently given, coding standards and design doc-uments should be provided. Developers should consider using LP for code devel-opment, so that the code can be written with the goal of human understanding, asopposed to computer understanding. The R community has the tools, such as Sweave,to facilitate this style of programming. Open source projects and commercial projectscan look to tools like cweb, noweb, etc.

• Visibility and Transparency: Projects that anticipate the involvement of future de-velopers should provide details about the development process adopted.

• Reproducibility: Software developers should explicitly track the environment used fordevelopment and testing, to make the software results more reproducible. Throughthe use of the R environment and Sweave, CRAN provides good examples for how todo this. However, the benefit of Sweave (or similar tools) would be improved if allCRAN developers were required to write vignettes for the package documentation.

References

Douglas Bates, Martin Maechler, Ben Bolker, Steven Walker, Rune Haubo, Bajesen Chris-tensen, Henrik Singmann, and Bin Dai. lme4: Linear mixed-effects models using Eigenand S4, 2014. URL http://cran.r-project.org/web/packages/lme4/index.html.R package version 1.1-7.

Berkeley. ConstructMap, 2012. URL http://bearcenter.berkeley.edu/software/

constructmap.

CRAN. The comprehensive r archive network, 2014a. URL http://cran.r-project.

org/.

CRAN. Cran task view: Psychometric models and methods, 2014b. URL http://cran.

at.r-project.org/web/views/Psychometrics.html.

Andrew P. Davison. Automated capture of experiment context for easier reproducibility incomputational research. Computing in Science and Engineering, 14(4):48–56, July-Aug2012.

Jan de Leeuw and Patrick Mair. anacor: Simple and Canonical Correspondence Analysis,2014. URL http://cran.r-project.org/web/packages/anacor/index.html. R pack-age version 1.0-4.

CAS-15-01-SS 24

Dictionary. The free on-line dictionary of computing, Jun 2014. URL http://dictionary.

reference.com/browse/commercialsoftware.

DIF. DIF-Pack, 2012. URL http://psychometrictools.measuredprogress.org/dif1.

DIM. DIM-Pack, 2012. URL http://psychometrictools.measuredprogress.org/dif2.

Thomas D. Fletcher. psychometric: Applied Psychometric Theory, 2013. URL http:

//cran.r-project.org/web/packages/psychometric/index.html. R package ver-sion 2.2.

flexMIRT. flexMIRT, 2014. URL https://flexmirt.vpgcentral.com/.

Matthias Gamer, Jim Lemon, Ian Fellows, and Puspendra Singh. irr: Various Coefficientsof Interrater Reliability and Agreement, 2014. URL http://cran.r-project.org/web/

packages/irr/index.html. R package version 0.84.

Marc-Oliver Gewaltig and Robert Cannon. Quality and sustainability of software tools inneuroscience. arXiv preprint arXiv:1205.3025, 2012.

Marc-Oliver Gewaltig and Robert Cannon. Current practice in software development forcomputational neuroscience and how to improve it. PLoS computational biology, 10(1):1003376, 2014.

Carlo Ghezzi, Mehdi Jazayeri, and Dino Mandrioli. Fundamentals of Software Engineering.Prentice Hall, Upper Saddle River, NJ, USA, 2nd edition, 2002.

Ben Goodrich. FAiR: Factor Analysis in R, 2014. URL http://cran.r-project.org/

web/packages/FAiR/index.html. R package version Ben Goodrich.

Kyung T. Han. IRTEQ: Windows Application that Implements IRT Scaling and Equating,Version 1.1, 2011. URL http://www.hantest.net/irteq.

Kyung T. Han. WinGen3, 2013. URL http://www.hantest.net/wingen.

Norman Herr. Summary of don norman’s design principles, 2002. URL http://www.csun.

edu/science/courses/671/bibliography/preece.html.

IATA. IATA, 2014. URL http://www.polymetrika.org/IATA.

jMetrik. jMetrik, 2014. URL http://www.itemanalysis.com/index.php.

Thomas Kiefer, Alexander Robitzsch, and Margaret Wu. TAM: Test Analysis Modules,2014. URL http://cran.r-project.org/web/packages/TAM/index.html. R packageversion 1.0-3.

D. E. Knuth. Literate programming. The Computer Journal, 27(2):97–111, 1984. doi:10.1093/comjnl/27.2.97. URL http://comjnl.oxfordjournals.org/content/27/2/

97.abstract.

CAS-15-01-SS 25

Andrew M. St. Laurent. Understanding Open Source and Free Software Licensing. O’ReillyMedia, Inc., 2004. ISBN 0596005814.

Friedrich Leisch. Sweave: Dynamic generation of statistical reports using literate dataanalysis. In Wolfgang Hardle and Bernd Ronz, editors, Compstat 2002 — Proceedingsin Computational Statistics, pages 575–580. Physica Verlag, Heidelberg, 2002. URLhttp://www.stat.uni-muenchen.de/~leisch/Sweave. ISBN 3-7908-1517-9.

Tie Liang, Kyung T. Han, and Ronald K Hambleton. ResidPlots-2, 2008. URL http:

//www.umass.edu/remp/software/residplots/.

LINFO. Freeware definition by the linux information project (linfo), 2006. URL http:

//www.linfo.org/freeware.html.

Patrick Mair, Reinhold Hatzinger, and Marco J. Maier. eRm: Extended Rasch Modeling,2014. URL http://cran.r-project.org/web/packages/eRm/index.html. R packageversion 0.15-4.

MINIFAC. MINIFAC, 2014. URL http://www.winsteps.com/minifac.htm.

MINISTEP. MINISTEP, 2014. URL http://www.winsteps.com/ministep.htm.

NCME. Psychometric software database — national council on measure-ment in education, 2014. URL http://ncme.org/resource-center/

psychometric-software-database/.

Nedialko S. Nedialkov. Implementing a Rigorous ODE Solver through Literate Program-ming. Technical Report CAS-10-02-NN, Department of Computing and Software, Mc-Master University, 2010.

Martyn Plummer, Nicky Best, Kate Cowles, Karen Vines, Deepayan Sarkar, and RussellAlmond. coda: Output analysis and diagnostics for MCMC, 2014. URL http://cran.

r-project.org/web/packages/coda/index.html. R package version 0.16-1.

R Core Team. Writing R extensions, 2014. URL http://cran.r-project.org/doc/

manuals/r-release/R-exts.html.

Gilles Raiche and David Magis. nFactors: Parallel Analysis and Non Graphical Solutionsto the Cattell Scree Test, 2014. URL http://cran.r-project.org/web/packages/

nFactors/index.html. R package version 2.3.3.

William Revelle. psych: Procedures for Psychological, Psychometric, and Personality Re-search, 2014. URL http://cran.r-project.org/web/packages/psych/index.html.R package version 1.4.5.

Dimitris Rizopoulos. ltm: Latent Trait Models under IRT, 2014. URL http://cran.

r-project.org/web/packages/ltm/index.html. R package version 1.0-0.

CAS-15-01-SS 26

Yves Rosseel, Daniel Oberski, Jarret Byrnes, Leonard Vanbrabant, Victoria Savalei,Ed Merkle, Michael Hallquist, Mijke Rhemtulla, Myrsini Katsikatsou, and MariskaBarendse. lavaan: Latent Variable Analysis, 2014. URL http://cran.r-project.

org/web/packages/lavaan/index.html. R package version 0.5-16.

Lawrence M. Rudner. PARAM, 2012. URL http://echo.edres.org:8080/irt/param/.

T.L. Saaty. How to make a decision: The analytic hierarchy process. Interfaces, 24(6):19–43, 1994.

TAP. TAP: Test Analysis Program, 2014. URL http://www.ohio.edu/people/brooksg/

software.htm#TAP.

Stuart H. Mann Triantaphyllou. Using the analytic hierarchy process for decision making inengineering applications. International Journal of Industrial Engineering: Applicationsand Practice, 2(1):35–44, 1995.

L. Andries van der Ark. mokken: Mokken Scale Analysis in R, 2013. URL http://cran.

r-project.org/web/packages/mokken/index.html. R package version 2.7.5.

Hadley Wickham. testthat: Get started with testing. The R Journal, 3:5–10, 2011. URLhttp://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.

Wikipedia. Psychometric software — wikipedia, the free encyclopedia, 2014. URL http://

en.wikipedia.org/w/index.php?title=Psychometric_software&oldid=607324773.

John T. Willse. mixRasch: Mixture Rasch Models with JMLE, 2014. URL http://cran.

r-project.org/web/packages/mixRasch/. R package version 1.1.

Greg Wilson, D.A. Aruliah, C. Titus Brown, Neil P. Chue Hong, Matt Davis, Richard T.Guy, Steven H.D. Haddock, Kathryn D. Huff, Ian M. Mitchell, Mark D. Plumblet, BenWaugh, Ethan P. White, and Paul Wilson. Best practices for scientific computing.CoRR, abs/1210.0530, 2013.

Werner Wothke. Simple C++ Numerical Toolkit, 2007. URL http://www.smallwaters.

com/software/cpp/scppnt.html.

Werner Wothke. ETIRM: Estimation Toolkit for Item Response Models, 2008. URL http:

//www.smallwaters.com/software/cpp/etirm.html.

Thomas W. Yee. VGAM: Vector Generalized Linear and Additive Models, 2013.URL http://cran.r-project.org/web/packages/VGAM/index.html. R package ver-sion 0.9-4.

CAS-15-01-SS 27

Appendices

A Full grading template

The table below lists the full set of measures that are assessed for each software product.The measures are grouped under headings for each quality, and one for summary informa-tion. Following each measure, the type for a valid result is given in brackets. Many of thetypes are given as enumerated sets. For instance, the response on many of the questionsis one of “yes,” “no,” or “unclear.” The type “number” means natural number, a positiveinteger. The types for date and url are not explicitly defined, but they are what one wouldexpect from their names. In some cases the response for a given question is not necessarilylimited to one answer, such as the question on what platforms are supported by the soft-ware product. Case like this are indicated by “set of” preceding the type of an individualanswer. The type in these cases are then the power set of the individual response type.In some cases a superscript ∗ is used to indicate that a response of this type should beaccompanied by explanatory text. For instance, if problems were caused by uninstall, thereviewer should note what problems were caused.

Table 5: Grading Template

Summary Information

Software name? (string)URL? (url)Educational institution (string)Software purpose (string)Number of developers (number)How is the project funded (string)Number of downloads for current version (number)Release date (date)Last updated (date)Status ({alive, dead, unclear})License ({GNU GPL, BSD, MIT, terms of use, trial, none, unclear})Platforms (set of {Windows, Linux, OS X, Android, Other OS})Category ({concept, public, private})Development model ({open source, freeware, commercial})Publications using the software (set of url)Publications about the software (set of url)Is source code available? ({yes, no})Programming language(s) (set of {FORTRAN, Matlab, C, C++, Java, R, Ruby, Python,Cython, BASIC, Pascal, IDL, unclear})

Installability (Measured via installation on a virtual machine.)

CAS-15-01-SS 28

Are there installation instructions? ({yes, no})Are the installation instructions linear? ({yes, no, n/a})Is there something in place to automate the installation? ({yes∗, no})Is there a specified way to validate the installation, such as a test suite? ({yes∗, no})How many steps were involved in the installation? (number)How many software packages need to be installed before or during installation? (number)Run uninstall, if available. Were any obvious problems caused? ({unavail, yes∗, no})Overall impression? ({1 .. 10})

Correctness and Verifiability

Are external libraries used? ({yes∗, no, unclear})Does the community have confidence in this library? ({yes, no, unclear})Any reference to the requirements specifications of the program? ({yes∗, no, unclear})What tools or techniques are used to build confidence of correctness? (string)If there is a getting started tutorial, is the output as expected? ({yes, no∗, n/a})Overall impression? ({1 .. 10})

Surface Reliability

Did the software “break” during installation? ({yes∗, no})Did the software “break” during the initial tutorial testing? ({yes∗, no, n/a})Overall impression? ({1 .. 10})

Surface Robustness

Does the software handle garbage input reasonably? ({yes, no∗})For any plain text input files, if all new lines are replaced with new lines and carriagereturns, will the software handle this gracefully? ({yes, no∗, n/a})Overall impression? ({1 .. 10})

Surface Performance

Is there evidence that performance was considered? ({yes∗, no})Overall impression? ({1 .. 10})

Surface Usability

Is there a getting started tutorial? ({yes, no})Is there a standard example that is explained? ({yes, no})Is there a user manual? ({yes, no})Does the application have the usual “look and feel” for the platform it is on? ({yes, no∗})Are there any features that show a lack of visibility? ({yes, no∗})Are expected user characteristics documented? ({yes, no})What is the user support model? (string)Overall impression? ({1 .. 10})

Maintainability

Is there a history of multiple versions of the software? ({yes, no, unclear})

CAS-15-01-SS 29

Is there any information on how code is reviewed, or how to contribute? ({yes∗, no})Is there a changelog? ({yes, no})What is the maintenance type? (set of {corrective, adaptive, perfective, unclear})What issue tracking tool is employed? (set of {Trac, JIRA, Redmine, e-mail, discussionboard, sourceforge, google code, git, none, unclear})Are the majority of identified bugs fixed? ({yes, no∗, unclear})Which version control system is in use? ({svn, cvs, git, github, unclear})Is there evidence that maintainability was considered in the design? ({yes∗, no})Are there code clones? ({yes∗, no, unclear})Overall impression? ({1 .. 10})

Reusability

Are any portions of the software used by another package? ({yes∗, no})Is there evidence that reusability was considered in the design? (API documented, webservice, command line tools, ...) ({yes∗, no, unclear})Overall impression? ({1 .. 10})

Portability

What platforms is the software advertised to work on? (set of {Windows, Linux, OS X,Android, Other OS})Are special steps taken in the source code to handle portability? ({yes∗, no, n/a})Is portability explicitly identified as NOT being important? ({yes, no})Convincing evidence that portability has been achieved? ({yes∗, no})Overall impression? ({1 .. 10})

Surface Understandability (Based on 10 random source files)

Consistent indentation and formatting style? ({yes, no, n/a})Explicit identification of a coding standard? ({yes∗, no, n/a})Are the code identifiers consistent, distinctive, and meaningful? ({yes, no∗, n/a})Are constants (other than 0 and 1) hard coded into the program? ({yes, no∗, n/a})Comments are clear, indicate what is being done, not how? ({yes, no∗, n/a})Is the name/URL of any algorithms used mentioned? ({yes, no∗, n/a})Parameters are in the same order for all functions? ({yes, no∗, n/a})Is code modularized? ({yes, no∗, n/a})Descriptive names of source code files? ({yes, no∗, n/a})Is a design document provided? ({yes∗, no, n/a})Overall impression? ({1 .. 10})

Interoperability

Does the software interoperate with external systems? ({yes∗, no})Is there a workflow that uses other softwares? ({yes∗, no})If there are external interactions, is the API clearly defined? ({yes∗, no, n/a})Overall impression? ({1 .. 10})

CAS-15-01-SS 30

Visibility/Transparency

Is the development process defined? If yes, what process is used. ({yes∗, no, n/a})Ease of external examination relative to other products considered? ({1 .. 10})Overall impression? ({1 .. 10})

Reproducibility

Is there a record of the environment used for their development and testing? ({yes∗, no})Is test data available for verification? ({yes, no})Are automated tools used to capture experimental context? ({yes∗, no})Overall impression? ({1 .. 10})

CAS-15-01-SS 31

B Summary of Measurements

B.1 Installability

Name II II linear AI Inst Valid # of Steps # S/w lib uninst

eRm yes yes yes no 2 1 noPsych yes yes yes no 4 4 nomixRasch general yes yes no 2 1 noirr general yes yes no 2 2 nonFactors general yes yes no 2 4 nocoda general yes yes no 2 1 noVGAM general yes yes no 2 4 noTAM yes yes yes no 2 7 nopsychometric general yes yes no 2 3 noltm general yes yes no 2 3 noanacor yes yes yes no 2 5 noFAiR yes yes yes no 3 7 nolavaan yes yes yes yes 2 13 nolme4 yes yes yes no 2 16 nomokken yes yes yes no 2 1 noETIRM yes no yes no 4 2 n/aSCPPNT no n/a yes no 1 0 n/ajMetrik yes yes yes no 1 0 noConstructMap yes yes yes no 1 1 noTAP yes yes yes no 1 1 noDIF-Pack yes yes yes no 5 1 noDIM-Pack yes yes yes no 5 1 noResidPlots-2 no n/a yes no 1 1 noWinGen3 yes yes yes no 2 2 noIRTEQ no n/a yes no 2 2 noPARAM no n/a yes yes 1 1 noIATA no n/a yes no 2 2 noMINISTEP yes yes yes no 1 1 noMINIFAC yes yes yes no 1 1 noflexMIRT yes yes yes no 1 1 no

Table 6: Installability – II: Installation instructions available, II linear: Linear instalationinstructions, AI: Automated Installation, Inst Valid: Tests for installation validation, #S/W lib: Number of software/libraries required for installation, uninst: Any uninstallationproblem?

CAS-15-01-SS 32

B.2 Correctness and Verifiability

Name Library SRS Evidence? Example

eRm yes (Cran) yes manual yesPsych yes (Cran) yes manual yesmixRasch yes (Cran) no manual yes (no expected output)irr yes (Cran) no manual yes (no expected output)nFactors yes (Cran) no manual yes (no expected output)coda yes (Cran) yes manual yes (no expected output)VGAM yes (Cran) yes manual yes (no expected output)TAM yes (Cran) no manual yespsychometric yes (Cran) no manual yes (no expected output)ltm yes (Cran) yes manual small difference in answersanacor yes (Cran) yes manual yes (no expected output)FAiR yes (Cran) yes manual yes (no expected output)lavaan yes (Cran) yes manual yeslme4 yes (Cran) yes manual yesmokken yes (Cran) yes manual yesETIRM yes (SCPPNT) no no yes (no expected output)SCPPNT no no no yesjMetrik yes (psychometric) no no yes (not from latest version)ConstructMap unclear no test case yes (output not clear)TAP unclear no no yes (no expected output)DIF-Pack unclear no no yes (no expected output)DIM-Pack unclear no no noResidPlots-2 unclear no no Example data is not givenWinGen3 unclear no no yes (no expected output)IRTEQ unclear no no yesPARAM unclear no no yes (no expected output)IATA unclear no no yes (no expected output)MINISTEP unclear no no yesMINIFAC unclear no no yesflexMIRT unclear no no yes

Table 7: Correctness and Verifiability – Library: Use of standard libraries, SRS: SoftwareRequirements Specification, Evidence?: Evidence to build confidence?, Example: StandardExample explained.

CAS-15-01-SS 33

B.3 Reliability, Robustness and Performance

Name install test Wrong I/P handling Format Perf

eRm yes no yes n/a unclearPsych yes no yes n/a unclearmixRasch no no yes n/a unclearirr no no unclear n/a unclearnFactors no no unclear n/a unclearcoda no no unclear n/a unclearVGAM no no unclear n/a unclearTAM no no yes n/a unclearpsychometric no no yes n/a unclearltm no no yes n/a unclearanacor yes no unclear n/a unclearFAiR yes yes yes n/a unclearlavaan no no yes n/a unclearlme4 no no yes n/a unclearmokken no no yes n/a unclearETIRM no unclear unclear n/a unclearSCPPNT yes no yes n/a unclearjMetrik yes no yes yes unclearConstructMap no no no n/a unclearTAP no no yes n/a unclearDIF-Pack no no yes n/a unclearDIM-Pack no unclear yes n/a unclearResidPlots-2 no unclear unclear n/a unclearWinGen3 no no yes n/a unclearIRTEQ no no no n/a unclearPARAM no no no n/a unclearIATA no no yes n/a unclearMINISTEP no no no no unclearMINIFAC no no yes yes unclearflexMIRT no no yes n/a unclear

Table 8: Reliablity, Robustness, And Performance – install: Software breaks during instal-lation, test: Software breaks during initial tutorial testing, Wrong I/P handling: Handlingof wrong input by software, format: Can the software gracefully handle a change of the for-mat of text input files where the end of line follows a different convention?, Perf: Evidencethat performance was considered?

CAS-15-01-SS 34

B.4 Usability

Name Tut Ex UM Look/feel Vis User Char User support

eRm no yes yes yes yes yes Forum/EmailPsych yes yes yes yes yes no EmailmixRasch no yes yes yes yes no Emailirr no yes yes yes yes no EmailnFactors no yes yes yes yes no Emailcoda no yes yes yes yes no EmailVGAM no yes yes yes yes no EmailTAM yes yes yes yes yes no Emailpsychometric no yes yes yes yes no Emailltm no yes yes yes yes no Email/FAQanacor no yes yes yes yes no Email/ForumFAiR no yes yes yes yes no Emaillavaan yes yes yes yes yes no Email/Discussion grouplme4 no yes yes yes yes no Emailmokken yes yes yes yes yes yes EmailETIRM no yes no no yes no EmailSCPPNT no no no no yes no EmailjMetrik yes yes no yes no yes Email/FAQ/Tech supportConstructMap yes yes yes yes no no Email/FAQTAP yes yes yes yes no yes EmailDIF-Pack yes yes yes no no no Email/Discussion groupDIM-Pack no no no no no no Email/Discussion groupResidPlots-2 no yes yes no no no EmailWinGen3 yes yes yes yes no no Email/SurveyIRTEQ yes yes yes yes no no EmailPARAM no no yes no no no noIATA yes yes yes no no yes EmailMINISTEP yes yes yes no no no Forum/Feedback/EmailMINIFAC yes yes yes no no no Forum/Feedback/EmailflexMIRT yes yes yes yes no no Email

Table 9: Usability – Tut: Getting started tutorial, Ex: Standard Example, UM: UserManual, Look/feel: Usual look and feel of software, Vis: Lack of visibility (Norman’sPrinciple), User Char: User characteristics documented.

CAS-15-01-SS 35

B.5 Maintainability

Name VH RC log MT Issue Bugs VS Evid Clone

eRm yes yes yes C Tracker yes svn yesPsych yes check yes C/P unclear yes unclear yes nomixRasch yes check no unclear unclear unclear unclear yes noirr yes check no unclear unclear unclear unclear yes nonFactors yes check no unclear unclear unclear unclear yes nocoda yes check yes C/P unclear yes unclear yes noVGAM yes check yes C/P unclear yes unclear yes noTAM yes check yes C/P unclear yes unclear yes nopsychometric yes check no unclear unclear unclear unclear yes noltm yes check yes C/P unclear yes unclear yes noanacor yes yes no unclear Tracker yes svn yes noFAiR yes check yes C/P unclear yes unclear yes nolavaan yes yes yes C/P git yes git yes nolme4 yes yes yes C/P git yes git yes nomokken yes check no unclear no unclear unclear yes noETIRM yes no no unclear no unclear unclear yes noSCPPNT yes no yes P unclear yes unclear yes nojMetrik yes no yes C/P unclear unclear unclear unclear unclearConstructMap yes no yes C/P unclear unclear unclear unclear unclearTAP yes no no unclear unclear unclear unclear unclear unclearDIF-Pack no contri no unclear unclear unclear unclear no noDIM-Pack no contri no unclear unclear unclear unclear no noResidPlots-2 no no no unclear unclear unclear unclear unclear unclearWinGen3 no no no unclear unclear unclear unclear unclear unclearIRTEQ no no no unclear unclear unclear unclear unclear unclearPARAM no no no unclear unclear unclear unclear unclear unclearIATA no no no unclear unclear unclear unclear unclear unclearMINISTEP yes no yes C/P unclear unclear unclear unclear unclearMINIFAC yes no yes C/P unclear unclear unclear unclear unclearflexMIRT no no no unclear unclear unclear unclear unclear unclear

Table 10: Maintainability – VH: Versions history available, RC: Information on reviewingand contributing, log: Change log available, MT: Maintenance type, Issue: Issue trackingtool, Bugs: Majority of bugs fixed, VS: Versioning system used, Evid: Any evidence thatmaintainability was considered in design, C: Corrective, P: Perfective, A: Adaptive, Clone:Are there code clones?

CAS-15-01-SS 36

B.6 Reusability

Name S/W reused by any package Any evidence of reusability

eRm yes yes (API)Psych yes yes (API)mixRasch unclear yes (API)irr yes yes (API)nFactors yes yes (API)coda yes yes (API)VGAM yes yes (API)TAM yes yes (API)psychometric yes yes (API)ltm yes yes (API)anacor no yes (API)FAiR no yes (API)lavaan yes yes (API)lme4 yes yes (API)mokken yes yes (API)ETIRM no yesSCPPNT yes yesjMetrik no noConstructMap no noTAP no noDIF-Pack no noDIM-Pack no noResidPlots-2 no noWinGen3 no noIRTEQ no noPARAM no noIATA no noMINISTEP no noMINIFAC no noflexMIRT no no

Table 11: Reusability

CAS-15-01-SS 37

B.7 Portability

Name Platforms Port in code Port not imp Evid in doc

eRm Win/Mac/Linux Branch in repository no yesPsych Win/Mac/Linux Branch in repository no yesmixRasch Win/Mac/Linux Branch in repository no yesirr Win/Mac/Linux Branch in repository no yesnFactors Win/Mac/Linux Branch in repository no yescoda Win/Mac/Linux Branch in repository no yesVGAM Win/Mac/Linux Branch in repository no yesTAM Win/Mac/Linux Branch in repository no yespsychometric Win/Mac/Linux Branch in repository no yesltm Win/Mac/Linux Branch in repository no yesanacor Win/Mac/Linux Branch in repository no yesFAiR Win/Mac/Linux Branch in repository no yeslavaan Win/Mac/Linux Branch in repository no yeslme4 Win/Mac/Linux Branch in repository no yesmokken Win/Mac/Linux Branch in repository no yesETIRM Win/Mac/Linux no no noSCPPNT Win/Mac/Linux no no nojMetrik Win/Mac/Linux n/a no unclearConstructMap Win/Mac/Linux n/a no unclearTAP Win n/a no noDIF-Pack Win unclear no noDIM-Pack Win unclear no noResidPlots-2 Win n/a no noWinGen3 Win n/a no noIRTEQ Win n/a no noPARAM Win n/a no noIATA Win n/a no noMINISTEP Win n/a no yesMINIFAC Win n/a no yesflexMIRT Win n/a no no

Table 12: Portability – Platforms: Platforms specified for the software to work on, Port incode: How portability is handled (if source code given), Port not imp: Portability explicitlyidentified as not important, Evid in doc: Convincing evidence present in documentationfor portability?

CAS-15-01-SS 38

B.8 Understandability (of code)

Name Frmt Std Id Const PC Algo Mod Para SC DD

eRm yes no yes no no no yes yes yes yesPsych yes no yes no no no yes yes yes yesmixRasch yes no yes no no no yes yes yes yesirr yes no yes no no no yes yes yes yesnFactors yes no yes no no no yes yes yes yescoda yes no yes no no no yes yes yes yesVGAM yes no yes no no no yes yes yes yesTAM yes no yes no yes no yes yes yes yespsychometric yes no yes no no no yes yes yes yesltm yes no yes no no no yes yes yes yesanacor no no yes no yes yes yes yes yes yesFAiR yes no yes no yes yes yes yes yes yeslavaan yes no yes no yes yes yes yes yes yeslme4 yes no yes no yes yes yes yes yes yesmokken yes no yes no yes no yes yes yes yesETIRM yes no yes no yes no yes yes yes yesSCPPNT yes no yes no yes no yes yes yes yesjMetrik n/a n/a n/a n/a n/a n/a n/a n/a n/a noConstructMap n/a n/a n/a n/a n/a n/a n/a n/a n/a noTAP n/a n/a n/a n/a n/a n/a n/a n/a n/a noDIF-Pack yes no yes yes yes no yes no yes noDIM-Pack yes no yes yes yes no yes no yes noResidPlots-2 n/a n/a n/a n/a n/a n/a n/a n/a n/a noWinGen3 n/a n/a n/a n/a n/a n/a n/a n/a n/a noIRTEQ n/a n/a n/a n/a n/a n/a n/a n/a n/a noPARAM n/a n/a n/a n/a n/a n/a n/a n/a n/a noIATA n/a n/a n/a n/a n/a n/a n/a n/a n/a noMINISTEP n/a n/a n/a n/a n/a n/a n/a n/a n/a noMINIFAC n/a n/a n/a n/a n/a n/a n/a n/a n/a noflexMIRT n/a n/a n/a n/a n/a n/a n/a n/a n/a no

Table 13: Understandability (of code) – Frmt: Consistent identation and formatting, Std:Explicit coding standard, Id: Distinctive, meaningful identifiers names, Const: Constants(other than 0 or 1) hard coded, PC: Proper comments, Algo: Reference to algorithm used,Mod: Code is modularised, Para: Parameters are in same order, SC: Descriptive names ofsource code files, DD: Design document present.

CAS-15-01-SS 39

B.9 Interoperability

Name External package Workflow uses other s/w API

eRm no yes yesPsych no yes yesmixRasch no yes yesirr no yes yesnFactors no yes yescoda no yes yesVGAM no yes yesTAM no yes yespsychometric no yes yesltm no yes yesanacor no yes yesFAiR no yes yeslavaan no yes yeslme4 no yes yesmokken no yes yesETIRM no no noSCPPNT no no nojMetrik no no noConstructMap no no noTAP no no noDIF-Pack no no noDIM-Pack no no noResidPlots-2 no no noWinGen3 no no noIRTEQ no no noPARAM no no noIATA no no noMINISTEP no no noMINIFAC no no noflexMIRT no no no

Table 14: Interoperability – External package: Software communicates with external pack-age, Workflow uses other s/w: Workflow uses other software, API: External interactions(API) defined?

CAS-15-01-SS 40

B.10 Visibility/Transparency

Name Dev process defined? Ease of ext exam

eRm no 9Psych no 8mixRasch no 9irr no 9nFactors no 9coda no 10VGAM no 10TAM no 10psychometric no 9ltm no 9anacor no 8FAiR no 7lavaan no 10lme4 no 9mokken no 9ETIRM no 6SCPPNT no 6jMetrik no 7ConstructMap no 7TAP no 6DIF-Pack no 6DIM-Pack no 6ResidPlots-2 no 5WinGen3 no 7IRTEQ no 7PARAM no 6IATA no 7MINISTEP no 8MINIFAC no 8flexMIRT no 8

Table 15: Visibility and Transparency – Dev process defined: Development process defined,Ease of ext exam: Ease of examination relative to other software (out of 10).

CAS-15-01-SS 41

B.11 Reproducibility

Name Dev env record Test data for verification Auto tools

eRm only for testing no noPsych only for testing no nomixRasch only for testing no noirr only for testing no nonFactors only for testing no nocoda only for testing no noVGAM Snap only for testing no noTAM only for testing no nopsychometric only for testing no noltm only for testing no noanacor only for testing no noFAiR only for testing no nolavaan only for testing no nolme4 only for testing no nomokken only for testing no noETIRM no no noSCPPNT no no nojMetrik no no noConstructMap no no noTAP no no noDIF-Pack no no noDIM-Pack no no noResidPlots-2 no no noWinGen3 no no noIRTEQ no no noPARAM no no noIATA no no noMINISTEP no no noMINIFAC no no noflexMIRT no no no

Table 16: Reproducibility – Dev env record: Record of development environment, Testdata for verification: Availability of test data for verification, Auto tools: Automated tools(like MADAGASCAR) used to capture experimental data.

department of computing and software - mcmaster … · department of computing and software january...

Documents