pragmatic software reuse in bioinformatics: how can social ...niunn/courses/icsr16/... · keywords:...

Pragmatic Software Reuse in Bioinformatics:How Can Social Network Information Help?

Xiaoyu Jin∗, Charu Khatwani∗, Nan Niu∗, and Juha Savolainen†

∗Department of EECS, University of Cincinnati, Cincinnati, OH, USA 45221†Head of Software Architecture, Roche Diagnostics, Rotkreuz, Switzerland, CH-6343{jinxu,khatwacu}@mail.uc.edu, [email protected], [email protected]

Abstract. Little is known about the specific kinds of questions thatbioinformatics programmers ask during pragmatic software reuse tasksand how well development online social networks help answer those ques-tions. To fill the gap, we report an empirical study involving 20 biomed-ical software developers performing reuse tasks. A key contribution ofour study is the discovery of 31 questions needed to be addressed, whichwe further classify into 5 categories along a software-architecture-centricand problem-domain-centric spectrum. Our study further provides evi-dence for the positive effect of social network information on pragmaticreuse tasks. Our work can lead to enhanced tool support so as to improvebiomedical software reuse in practice.

Keywords: Pragmatic software reuse, architecture-centric reuse, biomed-ical software, information needs, social network information.

1 Introduction

Software is a critical enabler to advance our understandings and make innova-tive discoveries in biomedicine. In fact, the software engineering challenges havegrown so immense that, in the United States, for example, the primary biomedi-cal and health-related funding agency — the National Institutes of Health (NIH)— began investigating ways to better discover software, namely to greatly facil-itate the biomedical research community to locate and reuse software [5].

An essential challenge here is pragmatic software reuse that uses the soft-ware artifacts which were not necessarily developed with reuse in mind [33].In contrast to pre-planned software reuse such as product line engineering [14],pragmatic reuse recognizes the opportunistic and exploratory nature of reuse de-cisions manifested in practices like copy-paste-modify code. A unique aspect inthis domain is that the programmers are often researchers whose principal train-ing area is not software engineering but biomedical related fields. In the dailywork of these bioinformatics researchers, pre-planned software reuse may not beinstrumented or enforced, leaving pragmatic reuse the only feasible option.

Current approaches to pragmatic reuse attempt to support the developer’sexplicit recording of a reuse plan and automatically enact certain steps of theplan [26], define metrics to indicate the effects of reuse on project performance [20,

29], and assist in the reusable component extraction by iteratively analyzing thestructural complexity of code [15]. In addition, various kinds of code searchand recommendation mechanisms are proposed [19, 25, 27, 28, 32], focusing onwhite-box reuse where existing code needs internal modifications so as to fit thetarget system. Despite the contemporary support, pragmatic software reuse re-mains a difficult endeavor. Among the salient challenges are the dependenciessurrounding the reusable code and the breakdowns experienced when the codeis integrated into the target system [33].

Such issues can be regarded as instances of architectural mismatch [23] rep-resenting a persistent difficulty in software reuse [24]. Architectural mismatchstems from the incompatible assumptions that each part of reuse had madeabout its operating environment. Although pragmatic reuse is often labeled adhoc [26], we believe it should not be performed without explicit architectural con-siderations. Existing pragmatic reuse approaches, however, have not thoroughlyexamined the role of software architecture in pragmatic reuse.

To fill the gap, we investigate in this paper the kinds of architectural knowl-edge involved in pragmatic reuse. We conduct a controlled experiment to elicitthe information needs from the bioinformatics programmers carrying out soft-ware reuse tasks. Our results drawn from 20 participants suggest 31 specificquestions, which we group along a software-architecture-centric and problem-domain-centric spectrum. Furthermore, we test the extent to which developmentonline social networks support the needs of the reuse tasks.

The contributions of our work lie in the discovery and codification of theinformation needs in pragmatic software reuse, as well as the positive evidencegained on the use of social network information to satisfy those needs. In whatfollows, we review related work in Section 2. Section 3 presents our study de-sign, Section 4 analyzes the results, and finally, Section 5 discusses our work’simplications and draws some concluding remarks.

2 Background and Related Work

2.1 Biomedical Software Discovery

Aimed to advance the understanding of human health and disease through har-vesting the wealth of information in biomedical data, the NIH launched the BigData to Knowledge (BD2K) initiative in 2012 [2]. While progresses in areas likedata accessibility are made, challenges persist for the biomedical community todiscover software in an effective and efficient manner. To tackle the problem, inMay of 2014, about 40 people from government, academia, and industry partic-ipated in a workshop on biomedical software discovery and released a report [5].

In this report, software discoverability is defined as the ability to locate, cite,and reuse software. Although, in theory, every biomedical software is knownand therefore discoverable, at least by its original developer(s), the improvedsoftware discoverability is much needed in practice by four specific stakeholders:(1) developers who face challenges measuring their software’s adoption, (2) userswho have difficulty in software reuse, (3) publishers who lack a consistent way

to handle software citations, and (4) funders who struggle to make informeddecisions about which software development projects to support.

The report then focuses mainly on proposing a global indexing solution. Thesolution is delineated as an automated, broadly accessible system allowing forcomprehensive identification of biomedical software. Central to the solution is as-signing each software tool with a unique identifier [5]. While such a proposal canfacilitate how biomedical software is disseminated and cited, serious drawbacksare pointed out by the commenters shortly after the release of the report [4]. Acommon criticism is about scalability and argues against the development of abrand new indexing capability to cover a broad biomedical software spectrum.

Another major problem with the unique-identifier indexing proposal is theconfusion between a software paper and the software itself [4]. In fact, somepublishers began archiving papers describing research software. Elsevier, forinstance, launched a new journal called SoftwareX in 2015 [6], aiming to dis-seminate software applications, tools, and libraries in various domains includingmedical and biological sciences. However, the unique identifier/index (e.g., a DOIlinking to a SoftwareX article) captures rather the static metadata about the en-tire software (e.g., weblink to code repository, legal license, support email, etc.)than the dynamic usage information about specific part(s) of the software. In an-other word, what a global indexer [5] supports is black-box software use insteadof white-box software reuse. Our objective is to tackle directly the pragmaticsoftware reuse challenges faced by the biomedical community.

2.2 Pragmatic Software Reuse

Reuse attempts to improve software quality and developer productivity by lever-aging existing artifacts and knowledge [30]. Two approaches can be distinguishedin terms of how the reusable artifacts are created and used. Pre-planned ap-proaches, such as object-oriented inheritance and product line engineering, ex-plicitly build artifacts for reuse so that subsequent software product/systemdevelopment can be carried out with reuse. In contrast, pragmatic approaches,such as code scavenging [30] and opportunistic programming [12], facilitate thereuse of software artifacts that were not necessarily designed for reuse [33]. Whilethe distinction is not always clearcut, a key difference is that pre-planned ap-proaches assume that a reusable part exists that either fits perfectly or thatthe target system can be adapted to make it fit whereas a pragmatic approachassumes that the reusable itself is a legitimate target for modification [26].

Maras et al. [33] identified 3 steps involved in pragmatic software reuse basedon their experience of Web application development: locating the source code ofan individual feature, analyzing and modifying the code, and integrating codeinto the target system. These steps are in line with the process model describedby Holmes and Walker [26]. In [26], a tool named Gilligan was introduced tosupport Java developer’s recording of a pragmatic reuse plan. Moreover, Gilliganhelped automate simple cycles of the plan (e.g., copy a manually found element,paste it in a manually determined location, flag syntactic warnings, etc.). Theexperiments with 16 participants (2 undergraduates, 7 graduate students, and

7 industrial developers) using Gilligan showed that, compared to the locationof reusable code, much difficulty occurred in analyzing the code, especially inresolving dangling dependencies to Java libraries, types, methods, and fields [26].

The difficult-to-resolve dependencies reflect incompatibilities at not only thesource code level, but also the software architecture level. Architecture-centricreuse approaches date back at least to the work of Garlan et al. [23], who ar-gued that a main reason why software architecture is important is because itlets designers exploit recurring architectural styles to reuse routine solutions forcertain classes of problems. Drawing from their experience of failing to build asystem from reusable parts, Garlan et al. [23] recognized a root cause being theconflicting assumptions among the parts and termed this phenomenon “archi-tectural mismatch”. Generally speaking, four categories of assumptions can leadto architectural mismatch: nature of the components, nature of the connectors,global architectural structure, and construction process.

Researchers have advanced architecture-centric reuse by trying to avoid ortolerate mismatch [18, 22], to sustain evolutionary stability [16, 38], and to cata-log specialized solutions specific to a particular domain in a way that restricts therange of permissible components and their interactions [21, 34]. Beyer et al. [10]reported a success story of introducing a product-line architecture to a smallsoftware team of 2 developers and 1 tester. During 4 iterations, an organization-specific software architecture was established, static architectural compliancechecks were performed, and the reduced development effort was observed. In [10],the main benefit of the product-line architecture was to help communicate andnegotiate the competing stakeholder concerns within the same organization. Inour work, the focus is on examining the role of general software architecturalstyles in pragmatic code reuse without any organizational boundary.

2.3 Development Social Networks

Social interactions of software engineers have been studied from various techni-cal and organizational aspects. For example, given a particular organization orproject, people can “be friends” with the work items that they share [9], latentsub-communities can be identified based on email exchanges [11], and the socialcoordination structures can be inferred according to the technical dependenciesextracted from the software artifacts [13].

What have recently emerged to shape software engineering practices are theonline social networks where developers collaborate and exchange ideas and ex-pertise. These technologies include community portals, Q&A sites, wikis, forums,and microblogs [8]. Reviewing feeds, watching projects, and following others arethe most used social networking functionalities among today’s software develop-ers [17]. Surprisingly, little is known about how software reuse can utilize andeven strengthen the online social networks. This lack of knowledge is especiallyprominent in pragmatic, white-box reuse tasks.

Exploiting online information to support software reuse, even before the so-cial networking era, is not without problems. Hummel and Atkinson [27] pio-neered the systematic investigation of the Web as a reuse repository; however,

the web services that they deployed for white-box reuse experienced discontin-uation and returned disappointing results. Happel et al. [25] found that mostsource code search engines focused on retrieving lines of code and often lackedthe capability to help the re-user explore in-depth connected information. Zouand colleagues [39] proposed an automated approach to searching software reusecandidates and using the developer comments extracted from online social net-works to perform sentiment analysis of the candidates. In sum, the support sofar has been extensive on code search but not on the reuse per se. Understand-ing how social network information can help carry out the actual reuse task isprecisely the focus of our research.

3 Study Design

Our work aims to answer two main research questions: what information isneeded in pragmatic software reuse and how social networks can help meet thoseneeds. Thus our inquiry is composed of two parts: first eliciting the informationneeds and then quantifying the effect of social networks. To address these re-search questions, we performed a lab experiment. The rest of this section detailsour experimental design and execution.

3.1 Participants

The population that our study intends to impact is the community of pro-grammers who develop biomedical software; notably, most are bioinformaticsresearchers [5]. Twenty participants took part in our experiment (12 male and8 female; 18 graduate students and 2 staff researchers). These participants wererecruited from the Cincinnati local community via email invitations. To be eli-gible to participate in our experiment, each individual had to consider writingsoftware as an essential (as opposed to accidental) part of their work, and con-sider that pragmatic code reuse (as opposed to pre-planned reuse) is common intheir practice. We did not impose any criteria regarding research area, softwaredevelopment experience, or programming language, as we attempted to select asample representative of developers across the broad biomedical domains. Ourparticipants had a varied background: 13 had no professional software devel-opment experience, 1 had less than a year professional experience, 3 had 1–5years, and 3 had more than 5 years. Table 1 overviews the demographics of theparticipants in our study. The data were collected through a pre-experimental,self-reported survey. Note that we performed two pilot trials before the actualexperimentation to test instrumentation and solicit feedback. The results fromthese two pilots are excluded for the rest of the paper.

3.2 Tasks

The participants were asked to perform pragmatic software reuse tasks that havedirect biomedical relevance. We explicitly considered software architecture whendesigning the tasks. Two architectural styles were chosen: plug-in architecture

Table 1. Overview of participants and their self-reported pre-experimental surveydata: “#” represents the number of participants, “Area” shows the participant’s mainresearch area, “PL” denotes one or more programming languages that the participantis familiar with, “SE Freq” classifies the software engineering (coding, debugging, etc.)frequency, and “Reuse Freq” indicates the frequency of pragmatic software reuse.

Area # PL # SE Freq # Reuse Freq #

Genomics 5 Python 16 Daily 12 Frequently 11

Gene regula-4 C/C++ 7 Weekly 4 Sometimes 7

tory networks

Biostatistics4 Matlab 5 Monthly 1 Rarely 0

survival analysis

Neuroimaging3 R 5

Others

3

Others

2analysis (as needed, (in-house

Others (molecular4

Others (C#,14

project- mainte-

biology, proteomics, etc.) Java, etc.) driven, etc.) nance, etc.)

and event-driven architecture. For each architecture, an open-source softwareacted as the target system where the actual reuse was expected to take place.Next is a description of these two systems and their reuse tasks.

• ImageJ [1] is a Java image processing program whose author, Wayne Ras-band, works at the NIH’s Research Services Branch. We downloaded thelatest version of ImageJ (v1.49) and ran it as a standalone application ona Windows lab machine. ImageJ provides extensibility via Java plug-ins.Some plug-in examples are automatically installed and can be accessed asshown in Figure 1a. It is believed that user-written ImageJ plug-ins makeit possible to solve almost any image processing or analysis problem [1].The reuse task that we defined for our participants was inspired by proteinquantification with ImageJ [3]. In particular, we pre-processed an image con-taining a variety of different proteins being separated on a gel. We storedthe pre-processing results in 4 textual files, which were provided as inputsto the reuse task: Protein.txt defining the values on the x-axis and eachof Result1.txt, Result2.txt, and Result3.txt giving rise to a protein sample.For each sample, the participant was asked to reuse code so as to drawthe gel plot and perform linear regression of that plot. Figure 1b illustratesone sample curve and its linear regression result: R2=0.917147. Biologicallyspeaking, the best protein fit is the sample with the greatest R2 value.

• StochKit [7] is a C++ biochemical reaction simulation program [35]. Weinstalled its latest version (StochKit v2.0.10) on the same lab machine asImageJ. StochKit utilizes event-driven architecture to achieve fine-grainedcontrol of the reaction process and to simulate real-time response. Eventtriggers are discrete changes in the system state or parameter value typicallyused to mimic biological processes or to recreate experimental conditions [35].Figure 2 simulates the reaction: Blue + Red → Green. The reuse task herewas motivated by a mathematical model of an open monosubstrate enzyme

(a) (b)

Fig. 1. ImageJ reuse task: (a) example plug-ins after installation, (b) sample output.original curve, right is the curve after performing above task.

Hin

0 1 2 3 4 5 6 7 8 9 10

time

0

10

20

30

40

50

60

70

80trajectory plot

Column 1

Column 2Column 3

0 1 2 3 4 5 6 7 8 9 10

time

0

10

20

30

40

50

60

70

80trajectory plot

Column 1

Column 2

Column 3

(a) (b)

Fig. 2. StochKit task: (a) before reuse, (b) after reuse.

reaction [36]. Specifically, we asked the participant to reuse code so that theenzyme reaction could be better controlled, namely, to follow [36] to increaseBlue’s volume under two conditions: (i) when its value drops below 5, and(ii) at time units 2 and 8. If the two conditions interact, (i) takes precedenceover (ii). The StochKit task illustrates that human intervention is essential,especially when the amount of reactants needs to be strictly regulated toachieve a stable biochemical reaction environment.

To circumvent the unfamiliarity with the target systems, we provided a fewseed elements [26] to assist in participant’s investigation. For the ImageJ task, theparticipants were pointed to the “Example Plot implements PlugIn” class and the“PlotWindow(..)” constructor. For the participants tasked with StochKit, 3 files’names were given: \src\model parser\Input events.ipp, \src\solvers\SSA Direct.ipp,and \src\solvers\SSA Direct Events.ipp. These seeds were provided to the partic-ipants on the printed hard copy of the task descriptions and were provided onlyas structural elements without any hint of runtime behavior [26].

Table 2. Social network information (SNI) provided for the ImageJ reuse task.ImageJ Bookmarks:

# Title Link

1 Data Analysis‐Linear Regression http://introcs.cs.princeton.edu/java/97data/

2 Development ‐ ImageJ http://imagej.net/Develop

3 Gel electrophoresis ‐ Wikipedia https://en.wikipedia.org/wiki/Gel_electrophoresis

4 gel quantification analysis [ImageJ Documentation Wiki]

http://imagejdocu.tudor.lu/doku.php?id=video:analysis:gel_quantification_analysis

5 image ‐ Live vertical profile plot in ImageJ ‐ Stack Overflow

http://stackoverflow.com/questions/19016991/live‐vertical‐profile‐plot‐in‐imagej

6 Java read file and store text in an array ‐ Stack Overflow

http://stackoverflow.com/questions/19844649/java‐read‐file‐and‐store‐text‐in‐an‐array

7 Java Read Files With BufferedReader, FileReader

http://www.dotnetperls.com/bufferedreader

8 Linear Regression http://stattrek.com/regression/linear‐regression.aspx

9 Linear regression ‐ Wikipedia https://en.wikipedia.org/wiki/Linear_regression

10 Plot issues in Jython script for ImageJ

http://stackoverflow.com/questions/26400563/plot‐issues‐in‐jython‐script‐for‐imagej‐reference‐sources‐welcome

11 Plugins (ImageJ) http://rsb.info.nih.gov/ij/plugins/index.html

12 Protein Electrophoresis | Applications & Technologies |

http://www.bio‐rad.com/en‐us/applications‐technologies/introduction‐protein‐electrophoresis

13 Read Text file in string array Java http://www.technical‐recipes.com/2011/reading‐text‐files‐into‐string‐arrays‐in‐java/

StochKit Bookmarks:

# Title Link

1 abs ‐ C++ Reference http://www.cplusplus.com/reference/cmath/abs/

2 C Program: Solving Simultaneous Equations in Two Variables

http://www.thelearningpoint.net/computer‐science/c‐program‐solving‐simultaneous‐equations‐in‐two‐variables

3 C++ ‐ Difference between .ipp extension and .cpp extension files

http://stackoverflow.com/questions/19147208/difference‐between‐using‐ipp‐extension‐and‐cpp‐extension‐files

4 Equations for 2 variable Linear Regression ‐ Stack Overflow

http://stackoverflow.com/questions/459480/equations‐for‐2‐variable‐linear‐regression

5 Event Driven Programming? ‐ Programmers Stack Exchange

http://programmers.stackexchange.com/questions/230180/event‐driven‐programming

6 Global Variables ‐ C++ Forum http://www.cplusplus.com/forum/windows/115425/

7 How do you make C++ solve equations? ‐ C++ Forum

http://www.cplusplus.com/forum/beginner/34039/

8 java ‐ Creating a simple event driven architecture

http://stackoverflow.com/questions/13483048/creating‐a‐simple‐event‐driven‐architecture

9 Solving a system of 2 Linear Equations using C++

http://stackoverflow.com/questions/14594240/solving‐a‐system‐of‐2‐linear‐equations‐using‐c

10 visual studio ‐ How to declare a global variable in C++

http://stackoverflow.com/questions/9702053/how‐to‐declare‐a‐global‐variable‐in‐c

Table 3. Social network information (SNI) provided for the StochKit reuse task.

ImageJ Bookmarks:

# Title Link

1 Data Analysis‐Linear Regression http://introcs.cs.princeton.edu/java/97data/

2 Development ‐ ImageJ http://imagej.net/Develop

3 Gel electrophoresis ‐ Wikipedia https://en.wikipedia.org/wiki/Gel_electrophoresis

4 gel quantification analysis [ImageJ Documentation Wiki]

http://imagejdocu.tudor.lu/doku.php?id=video:analysis:gel_quantification_analysis

5 image ‐ Live vertical profile plot in ImageJ ‐ Stack Overflow

http://stackoverflow.com/questions/19016991/live‐vertical‐profile‐plot‐in‐imagej

6 Java read file and store text in an array ‐ Stack Overflow

http://stackoverflow.com/questions/19844649/java‐read‐file‐and‐store‐text‐in‐an‐array

7 Java Read Files With BufferedReader, FileReader

http://www.dotnetperls.com/bufferedreader

8 Linear Regression http://stattrek.com/regression/linear‐regression.aspx

9 Linear regression ‐ Wikipedia https://en.wikipedia.org/wiki/Linear_regression

10 Plot issues in Jython script for ImageJ

http://stackoverflow.com/questions/26400563/plot‐issues‐in‐jython‐script‐for‐imagej‐reference‐sources‐welcome

11 Plugins (ImageJ) http://rsb.info.nih.gov/ij/plugins/index.html

12 Protein Electrophoresis | Applications & Technologies |

http://www.bio‐rad.com/en‐us/applications‐technologies/introduction‐protein‐electrophoresis

13 Read Text file in string array Java http://www.technical‐recipes.com/2011/reading‐text‐files‐into‐string‐arrays‐in‐java/

StochKit Bookmarks:

# Title Link

1 abs ‐ C++ Reference http://www.cplusplus.com/reference/cmath/abs/

2 C Program: Solving Simultaneous Equations in Two Variables

http://www.thelearningpoint.net/computer‐science/c‐program‐solving‐simultaneous‐equations‐in‐two‐variables

3 C++ ‐ Difference between .ipp extension and .cpp extension files

http://stackoverflow.com/questions/19147208/difference‐between‐using‐ipp‐extension‐and‐cpp‐extension‐files

4 Equations for 2 variable Linear Regression ‐ Stack Overflow

http://stackoverflow.com/questions/459480/equations‐for‐2‐variable‐linear‐regression

5 Event Driven Programming? ‐ Programmers Stack Exchange

http://programmers.stackexchange.com/questions/230180/event‐driven‐programming

6 Global Variables ‐ C++ Forum http://www.cplusplus.com/forum/windows/115425/

7 How do you make C++ solve equations? ‐ C++ Forum

http://www.cplusplus.com/forum/beginner/34039/

8 java ‐ Creating a simple event driven architecture

http://stackoverflow.com/questions/13483048/creating‐a‐simple‐event‐driven‐architecture

9 Solving a system of 2 Linear Equations using C++

http://stackoverflow.com/questions/14594240/solving‐a‐system‐of‐2‐linear‐equations‐using‐c

10 visual studio ‐ How to declare a global variable in C++

http://stackoverflow.com/questions/9702053/how‐to‐declare‐a‐global‐variable‐in‐c

While the seed elements were available to every participant, our indepen-dent variable was the social network information that we wanted to test in acontrolled manner. To instrument such a treatment, two researchers manuallysearched for useful online resources and jointly finalized a set of links for eachtask. Tables 2 and 3 list these links pointing to portals (e.g., ImageJ #11),wikis (e.g., ImageJ #3), forums (e.g., StochKit #7), Q&A sites (e.g., StochKit#5), etc. These links are by no means complete. Our intention is to raise theparticipant’s awareness of online social network information and offer a set ofspecific links to encourage them to take advantage of the information duringtheir reuse tasks. In this sense, the links in Tables 2 and 3 should be treated ashints that provide shortcuts to potentially useful information for carrying out

Table 4. Experimental block assignments.

ID (Block Name) First Task Second Task

A ImageJ-without-SNI StochKit-with-SNIB ImageJ-with-SNI StochKit-without-SNIC StochKit-without-SNI ImageJ-with-SNID StochKit-with-SNI ImageJ-without-SNI

the pragmatic reuse tasks. We grouped these “shortcuts” in the experimentalcomputer’s web browser’s bookmark — one bookmark folder per reuse task. Toavoid participant’s unintentional inference about the resources’ importance, weordered the links inside each bookmark folder alphabetically, as shown in Ta-bles 2 and 3. For the remaining of the paper, we use “pre-selected SNI” to referto the social network information presented in Tables 2 and 3.

3.3 Procedure

The participants worked individually in a lab and began by signing the con-sent form and completing a background survey (cf. Table 1). Each participantreceived a randomly assigned experimental ID and followed the correspondingblock assignment to perform the two reuse tasks. Table 4 shows our block de-sign, in which both SNI-treatment order and task order are counterbalanced.Thus, each participant performed one task with SNI and the other without thepre-instrumented SNI support. Similar to [26], our design is best understood aswithin-(participants plus SNI treatment) and between-(participants plus order).

A researcher explained the first reuse task to the participant. The task de-scription was printed on a hard copy which was presented throughout the taskperiod for easy reference. The researcher then introduced the target system, aswell as the seed elements by emphasizing their structural aspects. If the first taskwas with the SNI treatment, then the participant was made aware of the task-specific bookmark folder that the researcher pre-ported to the lab computer. Forthe instrumentation to be uniform, the researcher configured all the 3 browsers’bookmarkings of the computer in the same way: Internet Explorer, Mozilla Fire-fox, and Google Chrome. If the first task was in the control group receiving noSNI treatment, then the researcher would make sure no task-related bookmarksexisted in the browsers. The participant was then asked to perform the firstreuse task and was encouraged to “think aloud” to verbalize their rationales,decision, strategies, and tactics being employed. Note that the participant wasallowed to access the entire internet for completing the reuse task, independentof whether pre-selected SNI was present. Informed by our pilot trials, we setthe expected task completion time to be 20 minutes and communicated such anexpectation to the participant prior to the task. The participant was remindedaround 20 minutes into the task but was not forced to terminate until a naturalstop point was signaled by the participant himself or herself. The researcher thenconducted an informal interview with the participant to collect feedback, and ifthe first task was treated with SNI then the usefulness of the pre-selected SNIwas also surveyed verbally. The participant was given a break if desired, andthen continued with the second reuse task in the same manner.

Software architecture

centric

Problem domain

centric

C1: Reuse

infrastructure

C2: Components &

connectors

C3: Reuse

implementation

C4: Problem

solving

C5: Problem

understanding

Fig. 3. Categories of information needs in pragmatic software reuse.

4 Results and Analysis

4.1 Information Needs in Pragmatic Software Reuse

Understanding the needs of software developers is a prerequisite for researchersand tool builders to better answer those needs. For software evolution tasks,Sillito et al. [37] identified 44 specific questions programmers ask and furtherclassified those questions into 4 groups: (1) finding focus points, (2) expand-ing focus points, (3) understanding a subgraph, and (4) understanding groups ofsubgraphs. Ko et al. [31] abstracted from 17 Microsoft developers’ daily practicesinto 21 types of information needs, emphasizing the communication and coordi-nation demands in collocated software teams. The needs in pragmatic softwarereuse tasks, to the best of our knowledge, have not been thoroughly explored.

The participants in our study asked a variety of questions which we groupinto 5 categories. Figure 3 positions the categories along a software-architecture-centric and problem-domain-centric spectrum. The specific questions are pre-sented below, annotated with ‘I’ (relevant to ImageJ), ‘S’ (relevant to StochKit),or ‘B’ (relevant to both).

Reuse infrastructure (C1) touches upon the critical issues of the architecturalstyle underpinning the target system, and if not addressed properly, will likelycause serious architectural mismatch [23, 24] thereby hampering pragmatic reuse.

1. Where is the starting point that kind of likes a main function? [B]2. How does this software know that I am writing a plugin class? [I]3. What is the control structure/flow in an event-driven architecture? [S]4. Where is an event triggered and/or captured? [S]5. How to reuse the seeds to realize simple functions like ‘Hello World’? [B]

Components & connectors (C2) are at the heart of software architecture.Understanding the computation, the interface, the decomposition, and the in-terdependency is key to arrive at a successful reuse implementation.

6. How to name the plug-in class and what must be imported? [I]7. Where should I save the plugin class in the file system? [I]8. How do different events relate to each other? [S]9. Where to specify precedence of multiple events? [S]

10. How can I customize the data to fit into the function being reused? [B]11. How to initialize suitable variables to be applicable for a function? [B]12. What is the linkage between the computation units (methods, procedures,

etc.) and/or between the encapsulates (classes, templates, etc.)? [B]

Reuse implementation (C3) is where the two ends of Figure 3 meet. Solvingthe needs in this category will facilitate the completion of the pragmatic reusetask in an architecturally compatible way.

13. Where can I see the compilation information? [B]14. Is this software capable of printing things out to help me debug? [B]15. Where are the input files located and how to change the values related to

the reuse task in the (input) files? [B]16. How to resolve the dangling references of a reused code fragment? [B]17. How to output string with a numeric value together? [I]18. How is the visualization for trajectories done? [S]

Problem solving (C4) shifts the information needs toward the functionalitiesthat the reuse task dictates. Here the programmers search for reuse candidateswritten in same programming language as the target system.

19. Is there existing Java implementation that I can import (reuse) to calculatelinear regression as well as curve plotting? [I]

20. Can I find existing code to read data from text files in Java? [I]21. Is there available C++ code online to solve the linear equations? [S]22. What might be the existing implementation for a specific function (variable

type conversion [I], absolute value calculation [S], etc.)? [B]23. How to initialize member variables and vectors in Java or C++? [B]24. Are there unit tests to be reused together with code? [B]

Problem understanding (C5) helps the developers to clarify the conceptualquestions about the reuse task. Formulating an appropriate task context is im-portant to search and evaluate reuse candidates.

25. Why use linear regression in this task? [I]26. What is the original gel sample: human or other species? [I]27. How is the protein separated from the gel and what is the expected error

rate of the given input files as this can affect the kinds of (linear) regressionthat I do? [I]

28. Do I have to plot three curves in a single figure or in 3 separate figures? [I]29. What does species mean in a biochemical reaction? [S]30. Are those values (time units 2 and 8, volume below 5) arbitrary or do they

follow certain properties? [S]31. What is the biomedical significance of the task? [B]

0

1

2

3

4

5

6

7

ImageJ ImageJ ImageJ StochKit StochKit

C1

Chart Title

0

1

2

3

4

5

6

7

C1-ImageJ C1-StochKit

Chart Title

pre-selected used additional

0

2

4

6

8

10

12

14

16

C1-ImageJ C1-StochKit C2-ImageJ C2-StochKit C3-ImageJ C3-StochKit C4-ImageJ

# o

f Li

nks

Category Answered by Links

pre-selected used additional

0

2

4

6

8

10

12

ImageJ StochKit ImageJ StochKit ImageJ StochKit ImageJ StochKit ImageJ StochKit

C1: Reuseinfrastructure

C2: Components &connectors

C3: Reuseimplementation

C4: Problemsolving

C5: Problemunderstanding

# o

f u

niq

ue

SN

I lin

ks

pre-selected pre-selected & followed additionally followed

Fig. 4. SNI usage by tasks and information-need categories.

4.2 Supporting the Needs with Social Network Information

Having elicited the specific questions and characterized them, we now examinehow the SNI supports pragmatic reuse needs. The support is analyzed both qual-itatively and quantitatively. Figure 4 presents our qualitative analysis result, inwhich the mappings between the SNI links and the information-need categoriesare established. For each category, we present three statistics per task: the sup-port from pre-selected SNI links (cf. Tables 2 and 3), the pre-selected linksfollowed by the participants, and the additional online SNI that the participantsaccessed during pragmatic reuse.

A couple of observations can be made from Figure 4. First, most pre-selectedSNI links were actually followed. This indicates that the developers perceivedthe SNI as helpful hints which they were willing to spend time investigating.Second, for a category whose pre-selected SNI links were actively followed, moreadditional links were sought. This implies that the developers, once made awareof SNI support, were motivated to pursue more links, which in turn increasedthe likelihood of devising a reuse solution, as opposed to starting from scratch.

The usage data of Figure 4 increased our confidence in the magnitude ofthe impact that SNI had on pragmatic reuse. In another word, if little SNIwere followed, the impact would be trivial. To quantify such an impact, weassessed 2 variables: time to task completion and success of reuse solution. Thecomparisons were made between the control groups (participants who did notreceive pre-selected SNI links) and the treatment groups (those who did).

Figure 5 compares the time required for completing the reuse task in differ-ent settings. Generally speaking, developers spent less time on the ImageJ taskthan the StochKit task. This indicates that ImageJ’s architecture, namely theplug-in architecture, is more extensible. By conforming to basic architecturalconstraints and the construction process (e.g., importing the necessary libraries,storing the new class file in the plug-in folder, etc.), the developers were able toquickly extend the functionality of ImageJ. When the median completion timeis compared, pre-selected SNI links facilitated both tasks to be finished faster.

ImageJ−without−SNI ImageJ−with−SNI StochKit−without−SNI StochKit−with−SNI

1618

2022

24

Tim

e to

Tas

k C

ompl

etio

n

Fig. 5. Comparison of task completion time.

20%10%

30%

10%

50%

40%

50%

70%

30%

50%

20% 20%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

ImageJ-without-SNI ImageJ-with-SNI StochKit-without-SNI StochKit-with-SNI

Task Completion

Unsuccessful Partially successful Successful

Fig. 6. Comparison of reuse solution success.

However, the effect is statistically significant only on the ImageJ task (Wilcoxonsigned rank test: p=0.0098, α=0.05) but not on the StochKit task (Wilcoxontest: p=0.2357, α=0.05). We speculate this may be caused by the more effortin understanding StochKit’s event-driven architecture as well as in locating thefeature where pragmatic reuse would interact with [33]. Testing this hypothesisrequires further research.

In general, a (pragmatic) reuse task can have multiple, equally valid solutions.Thus we assessed the participants’ reuse solutions on an individual basis with-out a pre-determined ‘gold standard’ answer. Two researchers jointly judged allthe solutions and qualified them into 3 categories: successful (fulfilled function-ality with the reuse done conforming to the underlying software architecture),unsuccessful (unfulfilled functionality or solution developed from scratch), andpartially successful (things in between). Figure 6 shows the distributions. For the

ImageJ task, when provided with SNI, the developers could better complete thereuse in that the successful rate increased from 30% to 50% and the unsuccessfulrate decreased from 20% to 10% in Figure 6. For StochKit, SNI’s help seemedrather limited by shifting some unsuccessful reuse solutions to partially, but notcompletely, successful ones.

It is evident based on our observation and our interviews with the partici-pants that, for the two tasks, developers preferred reuse over devising a solutionfrom scratch. The preference was predominant. In addition, the support that thedevelopers received from the SNI — either pre-selected, additionally followed, orboth — is indisputable. In fact, all the participants in our study showed signifi-cant reliance on SNI that some went on using Google to confirm and even refinethe pre-selected links prepared by us. In sum, our results suggest the positiveimpact of SNI on answering developers’ needs and completing the pragmaticreuse tasks with speed and quality.

4.3 Threats to Validity

Our results about bioinformatics software developers’ information needs in prag-matic reuse are clearly influenced by the types of architectural styles and partic-ular systems adopted in our experiment. While the 31 specific questions may notbe generalizable to other settings, we believe the external validity is relativelystronger for our result categories (cf. Figure 3). Another external validity threatrelates to the representativeness of our study participants. While we tried tobe inclusive, the participants were recruited from a local community and wereprimarily affiliated with a university’s medical campus.

For our analysis about SNI’s effect on pragmatic reuse, an important con-struct validity threat is the pre-selection of the SNI links for both tasks. Weattempted to avoid repetitiveness in our SNI preparation; however, different re-searchers might select different online resources. Nevertheless, the usage datareported in Figure 4 show that any SNI preparation will be unavoidably incom-plete. As a result, we argue that the main purpose of the SNI is not to presentto the developers all the links, but to encourage them to engage in an activeinformation seeking and knowledge acquisition process.

5 Conclusions

For the biomedical community, reusing software in a pre-planned manner isoften infeasible. Pragmatic software reuse, therefore, is key to bioinformaticsprogrammer’s success in practice. This paper makes two major contributions:(1) a classification of 31 information needs elicited during pragmatic reuse tasks,and (2) a controlled experiment revealing the positive impact that developmentonline social networks have on meeting the information needs and on completingthe reuse tasks.

Our study’s results have several direct implications for tool building. First ofall, architecture decisions should be central to the identification and evaluation

of reuse candidates. Moreover, when the code is reused and integrated into thetarget system [33], architectural conformance should be checked and violationsshould be managed. Last, but certainly not the least, social network informationshould be seamlessly incorporated into the entire process of pragmatic softwarereuse: ranging from understanding the reuse infrastructure to implementing asuccessful reuse solution. It is hoped that our work illuminates a systematic wayto tackle architectural mismatch [24] in biomedical software reuse.

Acknowledgments. This research is partially supported by the U.S. NationalScience Foundation (Award CCF-1350487) and the National Natural ScienceFoundation of China (Fund No. 61375053).

References

1. ImageJ: Image Processing and Analysis in Java. http://imagej.nih.gov/ij/ Last ac-cessed: December 2015.

2. NIH Big Data to Knowledge. http://bd2k.nih.gov Last accessed: December 2015.3. Protein Quantification Using ImageJ. http://openwetware.org/wiki/Protein Quan

tification Using ImageJ Last accessed: December 2015.4. Software Discovery Index: Request for Comments. https://nciphub.org/resources/

888/download/Software Discovery Index Meeting Report with comments.pdf Lastaccessed: December 2015.

5. Software Discovery Meeting Report. https://nciphub.org/resources/885/supportingdocsLast accessed: December 2015.

6. SoftwareX: An Elsevier Journal. http://www.journals.elsevier.com/softwarex/ Lastaccessed: December 2015.

7. StochKit: Stochastic Simulation Kit. http://www.engineering.ucsb.edu/˜cse/StochKit/StochKit whatis.html Last accessed: December 2015.

8. A. Begel, J. Bosch, and M.-A. Storey. Bridging software communities through socialnetworking. IEEE Software, 30(1): 26–28, 2013.

9. A. Begel, Y. Khoo, and T. Zimmermann. Codebook: discovering and exploitingrelationships in software repositories. In ICSE, pages 125–134, 2010.

10. H.-J. Beyer, D. Hein, C. Schitter, J. Knodel, D. Muthig, and M. Naab. Introducingarchitecture-centric reuse into a small development organization. In ICSR, pages 1–13, 2008.

11. C. Bird, D. Pattison, R. D’Souza, V. Filkov, and P. Devanbu. Latent social struc-ture in open source projects. In FSE, pages 24–35, 2008.

12. J. Brandt et al. Opportunistic programming: writing code to prototype, ideate,and discover. IEEE Software, 26(5): 18–24, 2009.

13. M. Cataldo, J. Herbsleb, and K. Carley. Socio-technical congruence: a frameworkfor assessing the impact of technical and work dependencies on software developmentproductivity. In ESEM, pages 2–11, 2008.

14. P. Clements and L. Northrop. Software product lines: practices and patterns.Addison-Wesley, 2001.

15. E. Constantinou, A. Naskos, G. Kakarontzas, and I. Stamelos. Extracting reusablecomponents: A semi-automated approach for complex structures. Information Pro-cessing Letters, 115(3): 414–417, 2015.

16. E. Constantinou and I. Stamelos. Architectural stability and evolution measure-ment for software reuse. In SAC, pages 1580–1585, 2015.

17. L. Dabbish, H. Stuart, J. Tsay, and J. Herbsleb. Leveraging transparency. IEEESoftware, 30(1): 37–43, 2013.

18. R. de Lemos, C. Gacek, and A. Romanovsky. Architectural mismatch tolerance.In WADS, pages 175–194, 2002.

19. R. P. de Souza, M. N. Costa, R. M. M. Braga, M. Mattoso, and C. M. L. Werner.Software components reuse through Web search and retrieval. In WIIW, pages 12–18, 2001.

20. M. Dinsoreanu and I. Ignat. A pragmatic analysis model for software reuse. InSERA, pages 217–227, 2009.

21. P. V. Elizondo and K.-K. Lau. A catalogue of component connectors to supportdevelopment with reuse. Journal of Systems and Software, 83(7): 1165–1178, 2010.

22. C. Gacek and C. Gamble. Mismatch avoidance in web services software architec-tures. Journal of Universal Computer Science, 14(8): 1285–1313, 2008.

23. D. Garlan, R. Allen, and J. Ockerbloom. Architectural mismatch: why reuse is sohard. IEEE Software, 12(6): 17–26, 1995.

24. D. Garlan, R. Allen, and J. Ockerbloom. Architectural mismatch: why reuse is stillso hard. IEEE Software, 26(4): 66–69, 2009.

25. H.-J. Happel, T. Schuster, and P. Szulman. Leveraging source code search for reuse.In ICSR, pages 360–371, 2008.

26. R. Holmes and R. Walker. Systematizing pragmatic software reuse. ACM Trans-actions on Software Engineering and Methodology, 21(4): Article 20, 2012.

27. O. Hummel and C. Atkinson. Using the Web as a reuse repository. In ICSR, pages298–311, 2006.

28. W. Janjic, O. Hummel, and C. Atkinson. Reuse-oriented code recommendationsystems. In Recommendation Systems in Software Engineering M. P. Robillard etal. (eds.), Springer 2014, pages 359–386.

29. G. Kakarontzas, E. Constantinou, A. Ampatzoglou, and I. Stamelos. Layer assess-ment of object-oriented software: A metric facilitating white-box reuse. Journal ofSystems and Software, 86(2): 349–366, 2013.

30. C. Krueger. Software reuse. ACM Computing Surveys, 24(2): 131–183, 1992.31. A. Ko, R. DeLine, and G. Venolia. Information needs in collocated software devel-

opment teams. In ICSE, pages 344–353, 2007.32. O. A. L. Lemos, S. K. Bajracharya, J. Ossher, P. C. Masiero, and C. V. Lopes.

A test-driven approach to code search and its application to the reuse of auxiliaryfunctionality. Information & Software Technology, 53(4): 294–306, 2011.

33. J. Maras, M. Stula, and I. Crnkovic. Towards specifying pragmatic software reuse.In ECSAW, Article No. 54, 2015.

34. L. Rodriguez, A. Ampatzoglou, P. Avgeriou, and E. Nakagawa. A comparativeanalysis of reference architectures for healthcare in the ambient assisted living do-main. In CBMS, pages 270–275, 2015.

35. K. Sanft et al. StochKit2: software for discrete stochastic simulation of biochemicalsystems with events. Bioinformatics, 27(17): 2457–2458, 2011.

36. E. Sel’Kov. Self-oscillations in glycolysis. European Journal of Biochemistry, 4(1):79–86, 1968.

37. J. Sillito et al. Asking and answering questions during a programming change task.IEEE Transactions on Software Engineering, 34(4): 434–451, 2008.

38. D. Tofan, M. Galster, P. Avgeriou, and W. Schuitema. Past and future of soft-ware architectural decisions – A systematic mapping study. Information & SoftwareTechnology, 56(8): 850–872, 2014.

39. Y. Zou, C. Liu, Y. Jin, and B. Xie. Assessing software quality through web commentsearch and analysis. In ICSR, pages 208–223, 2013.

pragmatic software reuse in bioinformatics: how can social ...niunn/courses/icsr16/... · keywords:...

Documents