engineering bioinformatics: building reliability, performance and productivity into bioinformatics...

This article was downloaded by: [194.88.4.145]On: 15 July 2015, At: 08:31Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: 5 Howick Place,London, SW1P 1WG

Click for updates

BioengineeredPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/kbie20

Engineering bioinformatics: building reliability,performance and productivity into bioinformaticssoftwareBrendan Lawlora & Paul Walshb

a Cork Institute of Technology; Cork, Irelandb NSilico Life Science; Cork, IrelandAccepted author version posted online: 21 May 2015.Published online: 21 May 2015.

To cite this article: Brendan Lawlor & Paul Walsh (2015): Engineering bioinformatics: building reliability, performance andproductivity into bioinformatics software , Bioengineered, DOI: 10.1080/21655979.2015.1050162

To link to this article: http://dx.doi.org/10.1080/21655979.2015.1050162

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained inthe publications on our platform. Taylor & Francis, our agents, and our licensors make no representations orwarranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Versionsof published Taylor & Francis and Routledge Open articles and Taylor & Francis and Routledge Open Selectarticles posted to institutional or subject repositories or any other third-party website are without warrantyfrom Taylor & Francis of any kind, either expressed or implied, including, but not limited to, warranties ofmerchantability, fitness for a particular purpose, or non-infringement. Any opinions and views expressed in thisarticle are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. Theaccuracy of the Content should not be relied upon and should be independently verified with primary sourcesof information. Taylor & Francis shall not be liable for any losses, actions, claims, proceedings, demands,costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly inconnection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Terms & Conditions of access anduse can be found at http://www.tandfonline.com/page/terms-and-conditions It is essential that you check the license status of any given Open and Open Select article to confirmconditions of access and use.

http://crossmark.crossref.org/dialog/?doi=10.1080/21655979.2015.1050162&domain=pdf&date_stamp=2015-05-21

http://www.tandfonline.com/loi/kbie20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/21655979.2015.1050162

http://dx.doi.org/10.1080/21655979.2015.1050162

http://www.tandfonline.com/page/terms-and-conditions

Engineering bioinformatics: building reliability, performanceand productivity into bioinformatics software

Brendan Lawlor1,* and Paul Walsh2

1Cork Institute of Technology; Cork, Ireland; 2NSilico Life Science; Cork, Ireland

There is a lack of software engineer-ing skills in bioinformatic contexts.

We discuss the consequences of this lack,examine existing explanations and reme-dies to the problem, point out theirshortcomings, and propose alternatives.Previous analyses of the problem havetended to treat the use of software in sci-entific contexts as categorically differentfrom the general application of softwareengineering in commercial settings. Incontrast, we describe bioinformatic soft-ware engineering as a specialization ofgeneral software engineering, and exam-ine how it should be practiced. Specifi-cally, we highlight the difference betweenprogramming and software engineering,list elements of the latter and present theresults of a survey of bioinformatic prac-titioners which quantifies the extent towhich those elements are employed inbioinformatics. We propose that the idealway to bring engineering values intoresearch projects is to bring engineersthemselves. We identify the role of Bioin-formatic Engineer and describe how sucha role would work within bioinformaticresearch teams. We conclude by recom-mending an educational emphasis oncross-training software engineers into lifesciences, and propose research onDomain Specific Languages to facilitatecollaboration between engineers andbioinformaticians.

Problem Description

This paper identifies a significant lackof software engineering practices in bioin-formatics when compared to commercialsoftware development, which prevents thebioinformatic community from benefiting

from decades of engineering efficiencies,rigour and quality. The problem is presentin computational science in general, butfor the purposes of this discussion, we willconcentrate on bioinformatics. Softwareengineering skills are lacking, as is evidentin the way in which software is developedin bioinformatic contexts. Although biolo-gists and especially bioinformaticians pos-sess programming skills, and use thoseskills as part of their day to day work, theydo so in a way that is unstructured andnot in line with modern standards of soft-ware engineering.1,2 The problem has seri-ous consequences for the field ofbioinformatics and demands that we findeffective solutions.

We will examine these consequencesunder a number of headings, but in allcases they boil down to 2 overarchingproblems: the bioinformatic communityarrives at findings more slowly than it oth-erwise might, and those findings, whenarrived at, are less reliable than they mightotherwise be. By focusing on solutions tothe lack of software engineering skills inbioinformatics, we can address both ofthese effects. A first step is to better under-stand their nature by identifying moreprecisely how and where they arise.

Inability to reproduce findings

A lack of software engineering infra-structure and techniques means that manypublications which process data informati-cally cannot make that software or dataavailable in a reproducible way for peerreview. As a consequence, a significantpercentage of findings is likely to bereversed or withdrawn from publication.The use of infrastructure such as source

Keywords: bioinformatics, microbial bio-technology, process, survey, software, soft-ware engineering

© Brendan Lawlor and Paul Walsh*Correspondence to: Brendan Sean Lawlor; Email:[email protected]

Submitted: 02/04/2015

Revised: 05/05/2015

Accepted: 05/06/2015

http://dx.doi.org/10.1080/21655979.2015.1050162

This is an Open Access article distributed under theterms of the Creative Commons Attribution-Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits unrestrictednon-commercial use, distribution, and reproductionin any medium, provided the original work is prop-erly cited. The moral rights of the named author(s)have been asserted.

www.tandfonline.com 193Bioengineered

Bioengineered 6:4, 193--203; May 1, 2015; Published with license by Taylor & Francis Group, LLCCOMMENTARY

Dow

nloa

ded

by [

194.

88.4

.145

] at

08:

31 1

5 Ju

ly 2

015

http://creativecommons.org/licenses/by-nc/3.0/






code control systems and command-linebuild tools would improve the situation,by giving researchers the ability to easilypublish and share the software that wasused as part of their work. But these toolsare either unknown or simply consideredunnecessary for small teams of bioinfor-matic researchers.

Unreliability of findings

All surveys on scientific software devel-opment that we have reviewed cite a lackof software testing as being a constanttheme of scientific development. Segalpoints out that the “lack of any disciplinedtesting procedure” is a characteristic of anydevelopment practice where the end useris also the developer.3 According to areview by Morris “unit testsA often do notexist."4 Because of the fundamentallyimportant role of such tests in separatingproblems in the code from problems inthe hypotheses, findings based on insuffi-ciently tested software must be consideredin turn insufficiently tested themselves.Compare this to the use of defective oruncalibrated lab equipment in order tofully appreciate the nature of the problem.

Limitations in data sample size

Many scientists run their software onmulti-core desktops but do so in a single-threaded way which creates performancebottlenecks.5 This is most likely due to alack of familiarity with the kind of parallelcomputing techniques available to softwareengineers. The constraints that this practiceinevitably imposes on sample size or sophis-tication of data analysis are clear: In orderto execute programs to completion ondesktops, even if in a time frame of hoursand days, researchers will naturally reducethe number of sample points used, or elimi-nate steps which might increase statisticalpower but which have exponential or facto-rial performance profiles.5 A parameter

study, for example, can benefit from pair-wise comparisons of its features; but the

number of such pairs isn

2

� �, where n is

the number of features, so even modestvalues of n require concurrent program-ming and resource management to run tocompletion on desktop computers. Wheremulti-threaded implementations are usedin scientific programming, they typicallyinvolve using OpenMPB (for multi-core)or MPIC (for multi-server). These solu-tions use low-level primitives and as suchare painstaking to develop and can resultin error-prone code which is difficult tochange, especially in large systems.6 Soft-ware engineering research has morerecently concentrated on using higherabstractions which result in more intuitiveways to achieve concurrency, for examplethrough the use of the Actor architecture.7

There are examples of the successful use ofsuch engineering to the bioinformaticscommunity.8

Slowing the discovery cycle

Bioinformatic research is an iterativeprocess in which the computational ele-ment takes up a significant percentage. Ifthe researcher has to wait days to seecomputational results which will decidethe next direction that the research is totake, momentum is lost and the entireprocess of research itself is slowed down.Software engineers can bring skills likeperformance optimisation and concurrentprogramming to bear on this problem,significantly reducing waiting times.

Reinventing the wheel

According to Prabhu et al. “a consider-able portion of [scientists’] time is spentin many tedious [software development]activities” such as converting data formats

or retro-fitting inherited software to workfor new conditions.5 This is a direct conse-quence of insufficient software engineer-ing infrastructure and practices aroundthe research team. Researchers are obligedto repeatedly cobble together solutions forevery new direction they take. Naturallythe nature of these improvised solutionsdoes not facilitate their reuse - they typi-cally don’t exhibit high levels of maintain-ability or build-reproducibility - and sothe problem perpetuates itself.

In all of the above cases, we can discerna parallel to the argument made by Ioan-nidis with respect to inexpert use of statis-tics in studies.9 The danger to progress inbioinformatics is that much research maylater be found to be invalid due to inex-pert or non-transparent development ofsoftware. As Verma et al. point out, “theend goal of creating accurate and reliablescientific software is no less critical [thanwith commercial software] since incorrectresults would greatly compromise thevalidity of the discovery."1 This is anunsettling prospect indeed.

As in silico experiments become anincreasingly important form of researchand development, problems of reproduc-ibility and reliability will become moreobvious and more urgent. Moreover, soft-ware engineering techniques will be keynot just in addressing those problems, butin the initial conception and design ofsuch experiments.

Solutions from the Literature

These are some of the things that cango wrong in bioinformatic research whenwe fail to address the problem of its soft-ware engineering deficit. But why doesthis deficit arise in the first place? Andwhat can be done to improve matters?

A number of the authors we havereviewed offer explanations and remediesfor the problems described above. Hannayet al. identify a general lack of formal edu-cation and training and a reliance insteadon informal learning from peers.10 Segal& Morris among others emphasize thedifferences between scientific and com-mercial software development.11 Simi-larly, Verma et al. cite a lack ofrequirements engineering in bioinformatic

AUnit testing is a software development practice inwhich individual units of source code - for examplea class in object oriented programming, a proce-dure in imperative programming or a function infunctional programming - are tested in isolation todetermine whether they behave correctly.

BOpenMP - Open Multi-Processing - is an API speci-fication for shared memory parallel programming.See http://openmp.org/wp/.CMPI - Message Passing Interface - is a standardizedand portable message-passing system designed bya group of researchers from academia and industryto function on a wide variety of parallel computers.See https://www.open-mpi.org/.

194 Volume 6 Issue 4Bioengineered

Dow

nloa

ded

by [

194.

88.4

.145

] at

08:

31 1

5 Ju

ly 2

015

projects, as well as other factors that createthe “unique situation for the field of soft-ware engineering” represented by bioin-formatics.1 Umarji et al. focus exclusivelyon the gaps in the education of bioinfor-matic software developers in softwareengineering principles.12

It is important to correctly identify all ofthe significant causes of the problem. If westart with a false or incomplete diagnosisthe treatment is unlikely to be effective. Wewill look in detail at the root causes pro-posed by previous studies, but from the pre-vious paragraph we can see that there aresome elements in common in the way pre-vious authors have understood the prob-lem, and so in the solutions that they haveproposed. Here we will categorize them aseducation, methodology and special pleading,and they can be described as follows:

Education

Some authors have found that bioin-formaticians lack the necessary training inSoftware Engineering skills. Umarji et al.have surveyed bioinformatics curricula inthe United States and found that “out of atotal of 79 program offerings, there wereonly 2 instances where a software engi-neering related course was a required partof the curriculum” and that “there was nomention of the role and importance ofsoftware engineering in the curricula."12

Methodology

The wrong processes - or no processes atall - are being applied to the practice of bio-informatic research. Verma et al. reportthat “little emphasis is paid on the organi-zation and requirement gathering processin the early stages of the software."1

Special Pleading

According to some authors, the field ofscientific software development is so farremoved from the commercial settings inwhich modern Software Engineering hasemerged, that the rules from the lattersimply do not apply. Authors have sug-gested that the 2 contexts are“fundamentally different” for reasons of

subject domain complexity, requirementsvolatility and budgetary constraints. Thesedifferences make it problematic to“impose software engineering techniqueson scientists."11 So much space has beengiven to the differences between scientificand commercial development, that it isuseful to break it down further as follows.

� Subject Domain Complexity. Segal &Morris assert that in the case of scien-tific software development the subjectmatter is simply too complex for the“average developer." In a similar vein,Hannay suggests that “developers aremuch less likely to need to be domainexperts” in “regular” software develop-ment compared to scientific.11

� Requirements Volatility. According toSegal & Morris, “full up-front require-ment specifications are impossible”where scientists are concerned, and thatrequirements rather “emerge” on anongoing basis. The suggestion is thatthis is a distinctive feature of scientificprogramming, which makes the appli-cation of software engineering techni-ques more difficult.

� Budget and Resources. Verma et al.and Umarji et al. cite tighter budgetand timetable constraints as a differenti-ating factor of bioinformatic softwaredevelopment, and therefore as one pos-sible cause of a lack of software engi-neering best practices in that field.

End User (Scientist) as Developer

A number of authors point out culturaldifferences between scientists and softwareengineers as an important issue. Segal &Morris suggest that due to the subjectdomain complexity already mentioned,developers are likely to be the end-user sci-entists. But as Verma et al. point out,biologist stakeholders - who are the pri-mary stakeholders in these settings - “maybe more inclined to sacrifice programstructure to get something that works."

Naturally enough, the solutions pro-posed by these studies flow from the diag-noses of the problem. Those who concludethat the problem lies in education proposeimprovements to curricula. Those thatimplicate incorrect methodologies suggest

alternatives that are more suitable to bioin-formatics. Papers which emphasize the dis-connect (real or perceived) betweenscientific and software engineering worldsdon’t offer suggestions about how to bringsoftware engineering values into the scien-tific community, which again is natural,given their premise.

Software Engineering vsComputer Programming

Before we examine the existingexplanations and remedies for the soft-ware engineering deficit in bioinformat-ics, we make a brief but importantdigression: We outline the differencesbetween computer programming andsoftware engineering in order to preparefor later arguments that lean on thesedifferences.

The skills required to program arenot the same as those required to engi-neer a software solution. Programmingis a subset of the discipline of softwareengineering in much the same way thatdraftsmanship is a subset of the skillsrequired for architecture. This uncon-troversial fact is under-appreciated inscientific settings, for reasons aboutwhich we might only speculate. It takesa great deal longer to make a softwareengineer than it does simply to make aprogrammer. This should come as nosurprise, given the fact that SoftwareEngineering is a distinct academiccourse of studies and a distinct profes-sional discipline. Practicing softwareengineers draw from a large body ofacademic knowledge and a long andvital component of workplace experi-ence. There is a long-standing recogni-tion, going back to thought-leaders likeEW Dijkstra, that software engineeringis as much a craft as a science.13 Assuch, its skills are acquired as muchthrough a kind of apprenticeship asthrough the academic studies that pre-cede it. This has been sufficiently appre-ciated by educators that some havesought to incorporate elements of thatapprenticeship model into academiccoursework.14 The elements of softwareengineering practice that are oftenabsent from bioinformatic teams


Dow

nloa

ded

by [

194.

88.4

.145

] at

08:

31 1

5 Ju

ly 2

015

correspond to those elements which aretypically learned by the software engi-neering apprentice (source control,build systems, unit testing etc). This ishardly surprising: Scientists learn moresoftware development skills informallyfrom other scientists, and through self-study, than through formal education.10

In one study 84% of scientists whowere surveyed indicated that they hadrelied mostly on self-learning for theirsoftware skills.1 In either case, neithermode of learning can be compared tothe prolonged exposure to best practicesthat software engineers typically enjoy.

Elements of SoftwareEngineering Practice

In this section, we give an overview ofsome of the primary tools, techniques andskills of software engineering, and presentthe results of a survey which seeks to quan-tify the prevalence of these software engi-neering elements in bioinformaticsettings. Our choice of which tools andtechniques to emphasize are based onexperience as practitioners, and we find

ourselves in full agreement with otherauthors such as Wilson et al. with respectto those choices.15

Figure 1 shows the essential elementsof software engineering practice, and illus-trates the dependencies between them.We categorize Software Engineering ele-ments into the separate layers of infrastruc-ture, processes and practices, each layerbuilding on the one below.

The basis of good practice lies in thecorrect choice of the Tools and Infra-structure indicated in Figure 1. Ofcourse a software engineer chooses thetools based on the practices that shewishes to encourage, but their presencein a development environment is like agenetic marker that accompanies goodengineering standards. The layers repre-senting automated Processes and experi-ence-based Practices contain their own‘markers’ which depend on those in thelayers below: Even the most skilled andexperienced engineer will be thwarted byan inadequate development environ-ment. With this in mind, we designed asurvey to measure the prevalence ofthese layered ‘markers’ in bioinformaticresearch teams.

A Survey of BioinformaticSoftware Engineering Practice

We conducted 2 parallel surveys, onedistributed to life scientists, and the otherto developers of business software. In bothcases we asked questions to identify atti-tudes toward certain key ‘markers’ of soft-ware engineering as described in theprevious section. We reached 81 life scien-tists, 45 of whom developed their ownsoftware, and 36 business software devel-opers. We used the Likert system of ques-tionnaire design in which respondents ratetheir attitudes to statements from stronglydisagree to strongly agree with a total of 5degrees to choose from. We present theresults below in a form that compares thedifferences between the 2 groups. Thepurpose of the business software data is toact as a control for attitudes toward thesoftware engineering ‘markers’. From thefirst set of results (Fig. 2) it’s clear thatbusiness software developers and life scien-tists have distinctly different attitudestoward the standard elements of softwareengineering infrastructure.

Commercial developers almost unani-mously strongly agree with the statementsthat build systems, source control, IDEsand Continuous Integration engines areused in their place of work. Life scientistsshow no such consensus. The closest theycome to each other is in their attitude tothe statement on source control where onaverage they agree with it, but where a sig-nificant minority have no opinion or dis-agree. Source control systems are ofcentral importance in software engineer-ing practice, on a par with disinfectant inan operating theater. Complete adherenceto their use should be considered thenorm, as is borne out by the business soft-ware respondents. The other 3 elementsshould be considered similarly vital togood software engineering practice.

When it comes to processes (auto-mated or automatable) applied usingthe elements of infrastructure, the dis-tinction between life scientists and thecontrol group of business softwaredevelopers is still clear even if less pro-nounced (Fig. 3). This difference ismostly a function of a reduced consen-sus among software engineers ratherthan a positive change in attitudes fromFigure 1. Key Components of Software Engineering.


Dow

nloa

ded

by [

194.

88.4

.145

] at

08:

31 1

5 Ju

ly 2

015

the life scientists. A particular point tonotice is that although there is a rela-tively good showing for the use of

source control in the previous set ofresults, life scientists generally neitheragree nor disagree with the use of

branching, despite the fact branching isone of the main advantages of usingsource control.

Figure 3. Responses to questions on processes.

Figure 2. Responses to questions on infrastructure.


Dow

nloa

ded

by [

194.

88.4

.145

] at

08:

31 1

5 Ju

ly 2

015

As we look at the results for practicesand skills (Fig. 4), a pattern begins toemerge. The further up the pyramid wego, the ‘softer’ the consensus among soft-ware engineers, while the attitudes of thelife scientists remain more or less static.The overall picture of a clean, albeitsmaller, separation remains.

The results dealing with goals andambitions (Fig. 5) present a break withthe previous pattern. Rather than thesoftware engineers falling back to the

neutral position of the life scientists,the latter group shows a stronger andclearer consensus in favor of the state-ments presented to them. In fact thereis no discernible difference in attitudesbetween the 2 camps. It is interestingthat in this section we have posed ourquestions in a slightly different way.Rather than asking about actual use, wehave asked about importance. The goalsand aspirations of the life scientists withregard to software architecture are no

different to those of commercial soft-ware engineers. What they lack how-ever, as indicated by the previousresults, are the instruments and techni-ques necessary to achieve those goals.

Alternative Solutions

The results of our survey confirmthe deficit in bioinformatic software engi-neering skills, while at the same time

Figure 4. Responses to questions on practices.


Dow

nloa

ded

by [

194.

88.4

.145

] at

08:

31 1

5 Ju

ly 2

015

indicating an ambition among bioinfor-maticians to bridge the gap. We now lookat what the causes and remedies of thisdeficit might be and revisit the reviewedliterature. We believe that in order toaddress this deficit effectively, we musttake into account the difference betweencomputer programming and softwareengineering as discussed above. We assertthat it is impractical, if not impossible, tointroduce the missing software engineer-ing expertise into bioinformatics by treat-ing that expertise as a sub-component ofbioinformatics. Software Engineeringencompasses too large a body of knowl-edge, which is acquired by too different aform of education to simply be bolted onto existing bioinformatic curricula. Putanother way, we believe that the mosteffective way of introducing software engi-neering values into bioinformatic researchis to introduce software engineers them-selves, by recognizing the separate role ofthe Bioinformatic Engineer in bioinfor-matic research projects, and identifyingthe interface between the engineer and thescientist. Before we discuss how this might

be done, we look again at the solutionsfrom the existing literature, in the light ofour assertions and findings above.

Education

While improvements in bioinformaticcurricula, as suggested by Umarji et al.would be a positive step that could lead toimproved communication between bioin-formaticians and software engineers, suchimprovements would not be sufficient tobridge the current gap.12 We believe thatin addition, educators should target soft-ware engineering curricula and create spe-cialized Masters and PhD programs inBioinformatic Engineering, creating spe-cialized software engineers who can dialogwith biologists and bioinformaticians ascustomers based on a shared understandingof the research environment and the biol-ogy domain. Early introduction of soft-ware engineering graduates intobioinformatic research programmes wouldhave a positive influence on the softwaredeveloped as part of such research.

Methodology

Selecting appropriate methodologies isanother necessary but insufficient step.Investigations into software engineeringmethodologies that suit bioinformaticsprojects are worthy, but who would steerthe use of such techniques in the absenceof a skilled and experienced software engi-neer? As Kane et al. have found, “[the]agile development approach . . . provides amodel for collaboration between softwareengineers and researchers."16 In otherwords, a good methodology works best inthe context of existing software engineer-ing skills, rather than as a replacement forthem. Given such a context, it is worthpointing out the advantages of applyingagile methodologies to scientific softwaredevelopment in general, and bioinformat-ics in particular. One benefit of agile pro-cesses is a rigour in defining requirementswhile at the same time embracing changein a way that permits discovery throughprototyping. Agility can be seen as anexample of modern software engineeringserving the needs of bioinformatics.

Figure 5. Responses to questions on goals.


Dow

nloa

ded

by [

194.

88.4

.145

] at

08:

31 1

5 Ju

ly 2

015

Special Pleading

What about those arguments touchedon above which suggest that scientific soft-ware development is too complex, toofluid in its requirements, and too badlyfunded to use software engineering techni-ques? Such arguments are based on specialpleading and are problematic in a numberof ways. Firstly, they don’t point towardsolutions. And secondly, such claims ofbeing a special case can be arrived at tooeasily by specialist groups such as biolo-gists, and fit too well with assumptionsand professional biases - asserted andaccepted without ever being truly exam-ined. We examine those assumptions nowin the order outlined above: complexity ofdomain, volatility of requirements and lim-ited resources.

Complexity of Domain

There is something inherently contra-dictory in the claim that systems biology istoo complex for software engineering orsoftware engineers, thus making the biolo-gist-as-developer a necessary feature of thebioinformatic landscape. If biological sys-tems are complex then it follows that thesoftware systems which model them will becomplex. The complexity of the softwarehowever is twofold: Firstly there is theProblem Domain complexity inheriteddirectly from the biology. Secondly there isthe Solution Domain complexity that isinherent in any software abstraction. Thislatter, software-specific complexity is equalto the complexity of the modeled biology,but may add extra complexity of its own,depending on how sensibly the software isdesigned and realized. It takes a skilledsoftware engineer, using modern softwareengineering techniques, to minimize thissoftware complexity factor. It is clear thatthe “average developer” will not acquirebiological expertise to same extent as thebiologist, and will understand the biologyonly to that extent required to capture thenecessary abstractions for the problem inhand, in collaboration with the biologist.It should be equally clear that complex sys-tems modeled in software exclusively bybiologists with limited software engineer-ing experience will suffer from the

limitations outlined at the beginning ofthis article. In the increasingly parallel, dis-tributed and data-saturated context ofmodern bioinformatics, the exclusive roleof scientist-as-developer advanced by Segal& Morris should be considered a bugrather than a feature.

Volatility of Requirements

The observation that scientific require-ments are simply too fluid will bring arum smile to the face of any experiencedsoftware engineer. The day-to-day realityof commercial projects is very different tothe clean lines described in methodologyliterature. Perceived business needs alwayscome first, often to the detriment of bestpractice. Part of the engineer’s job is toincorporate unexpected and even capri-cious requirements into the project whileminimizing the damage done.

In one sense, the life sciences enjoy animportant advantage over business: Theproblem domain is much more stable overtime and across projects. Certainly itgrows to incorporate discoveries and occa-sional upheavals. But amino acids and celldivision don’t go in and out of fashionlike financial instruments or business pro-cesses. Biologists uncover and even invent,but the underlying biology itself limitsnovelty. This allows engineers to build upand usefully retain expertise in the problemdomain. (This cannot be said about com-mercial domains, where the only underly-ing biology that limits change is the neo-cortex of the customer.) One feature ofmodern software development which cantake advantage of this relatively stabledomain and facilitate communicationbetween engineer and scientist is theDomain Specific Language (DSL).17 Asan alternative to a general purpose pro-gramming language, a DSL can provide afluent interface between the problemdomain of the biologist and the solutiondomain of the engineer. As such a DSL“offers substantial gains in productivityand even enables end-user program-ming."18 As pointed out by Swertz et al,“[t]he working systems biologist wants toapply software tools to increase the under-standing of biological function withouthaving to ‘tinker under the hood’."19

DSLs bring some potential disadvantagesas well, for example the risk of creating‘islands’ of code so specialized as tobecome impenetrable to the non-expertuser. Notwithstanding such risks, andindeed by way of addressing them, weconsider the application of DSLs to bioin-formatic software development as a wor-thy subject of further research.

Limited Resources

Budgets on commercial software proj-ects are tight, as are the deadlines, and anyexperienced developer knows that there is acontinuous cost/benefit calculationinvolved when making any significanttechnical decision. In this sense, commer-cial projects are no different to scientificresearch programs. What does differ is thebudgeting process. Bioinformatic research-ers should allocate adequate resources forsoftware development at the outset.

Bioinformatics Engineering

We are arguing here for the recognitionof the separate role of Bioinformatic Engi-neer in research teams, but this raises manyquestions of a practical nature and perhapssome philosophical ones too. How shouldbioinformatic engineers and bioinformati-cians best communicate? Where wouldtheir competencies overlap? What shouldsmall teams with limited funding do? Andin any case, does this separation of roles flyin the face of the cross-disciplinary natureof bioinformatics itself?

The intersection of the 2 sets inFigure 6 shows the role that education canplay in preparing bioinformaticians andbioinformatic engineers to work together.Engineers need to know enough about thebiology domain to communicate effec-tively with bioinformaticians. Complexityof the problem domain does not preventthis from happening in similarly complexcommercial settings, and despite muchspecial pleading in the literature there isinsufficient reason to think that bioinfor-matics would be different. Commercialsoftware engineers typically specialize in‘verticals’ and market themselves as muchon the basis of their domain experience as


Dow

nloa

ded

by [

194.

88.4

.145

] at

08:

31 1

5 Ju

ly 2

015

on their technical skills. Bioinformaticscan be seen as a particularly stable andwell defined problem domain, itself subdi-vided into various verticals. Bioinformati-cians already understand programmingenough to communicate their ideas andrequirements through code (even if, as wehave indicated earlier, there is enormouspotential for DSLs to close the communi-cation gap even further).

The non-intersecting parts of the 2 setsdemonstrate the need for the bioinfor-matic engineer in the first place. Theentire field of software engineering is toolarge to incorporate into the skillset of bio-informatics, and much of it is of no inter-est to the bioinformatician in the firstplace. Nobody expects him to build, oreven understand the inner workings of thecentrifuges and mass spectrometers thatare so essential to research. Why thenshould we expect him to master the art ofbuilding large-scale, performant and pro-duction-ready software systems?

The point we are making in distin-guishing the role of Bioinformatics Engi-neer can be summarized as follows:Software Engineering is vital to the disci-pline of bioinformatics without being acore skill of that discipline. This question ofspecialization is a logistic or even eco-nomic one which finds echoes in Ricardo’sLaw of Comparative Advantage: Even if itwere possible for bioinformaticians to sub-sume the entire discipline of softwareengineering into their body of knowledge,it would not be desirable.20 It would sim-ply represent bad value. A bioinformati-cian investing the necessary time in

engineering skills would pay a heavy pricein terms of Opportunity Cost - the timenot spent on study and research in corebiological questions. Much better to leanon an engineering specialist in those keymoments of research and developmentwhen engineering skills come to the fore.

What, then, are those key moments?Figure 7 categorizes the kinds of softwaredevelopment that would typically takeplace in a research team into 4 quadrants,based on 2 variables: Whether the work iscore or peripheral to the team’s output(focus), and whether the resulting softwareshould be considered temporary or perma-nent (durability). We can use these varia-bles to pinpoint the phases of researchwhere bioinformaticians could increasetheir productivity by handing over to bio-informatic engineers, or at the very least,“change hat” and temporarily adopt anengineering approach.

To explain what we mean by these cate-gories and variables, we refer to Morris’observation4 that “[o]ne concern is that sci-entific prototype code, if successful, seguesinto applications that are distributed forwider research use. Later it may be adoptedfor production purposes, sometimes evenfor safety critical use.” In other words, it isimportant to allow bioinformaticians tocreate code that is exploratory in naturebut fragile from an engineering point ofview. But it is equally important to ensurethat such code does not form the basis ofpublished findings or shared products andtools. The consequence of such fragility onpublished finding includes, but is not lim-ited to, a difficulty in reproducing results,

or a difficulty in analyzing the correctnessof the code (due for example to poor read-ability, unreproducible builds, or evenaccess to the correct version of the code).The consequence of fragile engineering onshared products and tools should be self-evident. A necessary balance between theneed to explore and the need to consolidatemust be struck, and we model this balancewith the durability variable that distin-guishes between temporary and permanentsoftware.

Once we know which category a partic-ular piece of software belongs to, we cancreate procedures for moving it to a differ-ent category should the need arise. Forexample, according to Sanders and Kellysome teams took a “do it twice” approach- that is, a rewrite of software according tomore exacting engineering require-ments.21 This corresponds to moving soft-ware from the upper half to the lower halfof Figure 7. So this is already practiced insome research teams. The point is toexplicitly recognize these categories andput processes in place to avoid the kind oferror that Morris describes.

The other variable, focus, distinguishesbetween software that is used as part ofthe scientific discovery process in a specificline of research, and code that could beconsidered ‘utility code’ to be reused inmany different settings. The formershould in principle be published alongwith the findings it helped to produce.The latter might find its way into a com-mercial or opensource product to beshared with the wider bioinformatic com-munity. In both cases, the need for atransformation from temporary to perma-nent is the same, but the engineering skillsand processes used to achieve it would dif-fer - hence the distinction between coreand peripheral software.

If a research team cannot fund a dedi-cated software engineer, it can still makeuse of the ideas presented here. The cross-over points in competencies that we haveidentified above can serve as processboundaries, indicating where bioinforma-ticians should “change hat” and begin toapproach their work with different goalsin mind. But in order for this to happen,they must know that these boundariesexist; at a minimum they should be edu-cated in an appreciation of software

Figure 6. Suggested project Roles of bioinformaticians and bioinformatic engineers.


Dow

nloa

ded

by [

194.

88.4

.145

] at

08:

31 1

5 Ju

ly 2

015

engineering even if their own engineeringtraining will be of necessity - a peripheralpart of their curriculum. As teams ramp-up in size and funding, they will permitthemselves to take on specialists, and wecontend that bioinformatic engineersshould be one category of such specialists.

Projects that weave software engineer-ing best practices into bioinformaticsresearch and in silico experimentation reapconcrete rewards. By employing softwareengineering techniques such as a layeredarchitecture, explicit development modelsand a rigorous requirements-gatheringapproach, Walsh et al.22 produced anaccelerated research workflow tool, whichis amenable to extension and is highlyscalable. A blended team of biologists,computational scientists and engineerswhich has modeled integrated physiologi-cal processes of Caenorhabditis elegans(C. elegans) in silico has asserted that “[i]norder to be able to effectively manage thecomplexity that comes with integrating

and maintaining coarse-grained architec-tures, tools, digital information artifactsand codebases, it is important for compu-tational biology to fully embrace softwareengineering methodologies and best prac-tices and follow the lead of the simulationbased research in the physicalsciences.”23,24

Conclusion

Bioinformatics is still in the cradle,compared to many of its sibling sciences.In common with many other fields thatcombine computation, mathematics andstatistics with the sciences, a lot of thoughtand energy is going into the creation oftruly cross-disciplinary practitioners. Thegoal is to combine in one brain a richknowledge of both biology and computa-tion, because answering the questions thatarise in one has become heavily dependent

on mastering the skills developed in theother.

While there is no doubt about thesoundness of this ambition, we feel thata distinction must be made betweencomputational skills and software engi-neering skills. More to the point, wefeel that these skill sets are so diverseand mastered by such different meth-ods, that it is unrealistic to expect a sin-gle practitioner to combine biology,computational methods and softwareengineering. Moreover, it is unnecessaryand uneconomical to try.

The alternative is already available tous. Software engineering is a discipline inwhich we apply computational skills toproblems of other disciplines in such away as to result in robust, reliable andmaintainable solutions. While some fieldsof application are more exacting thanothers there is no qualitative differencebetween commercial software engineeringand scientific software engineering. Theextra degree of scientific complexity hasparallels in commercial software develop-ment. The existing tools, techniques andpractices of software engineers can bendto the particular needs of research. Theonly question that remains is how to reli-ably place those skills of modern softwareengineering at the disposal of bioinfor-matic researchers.

We argue for the explicit recognition ofthe role of the bioinformatic engineer, asoftware engineer who has been educated inthe standard way for that discipline, and hasspecialized in the ‘vertical’ of systems biol-ogy (or a sub-field such as genomics, ormetabolomics). Such an individual wouldembody all the skills that one would expectfrom an expert software engineer but wouldalso have a deep understanding of the kindsof problems that biologists need to solve,and an appreciation for the manner inwhich they go about their research. In otherwords, we believe that the most effectiveway of introducing software engineeringvalues into bioinformatic research is tointroduce software engineers themselves. Asa reasonable compromise, where this idealis not immediately achievable, bioinforma-ticians could perform the role of bioinfor-matics engineer during the delineatedphases of project work that we haveidentified.

Figure 7. Handover points between bioinformaticans and bioinformatic engineers.


Dow

nloa

ded

by [

194.

88.4

.145

] at

08:

31 1

5 Ju

ly 2

015

One difficulty to be addressed as partof the proposed approach is hinted at byPrabhu et al. when they quote one scien-tist as saying that even “funding agenciesthink software development is free,” andregard development of robust scientificcode as “second class” compared to otherscientific achievements.5 The way inwhich research projects are funded doesnot currently take into account the costsassociated with developing software.While not every project will be able tobudget for a full-time bioinformatic engi-neer, research groups should be able toshare such resources, or make use of spe-cialized external software companieswhich would grow in number to meetdemand.

The bioinformatic engineer does not inany sense remove the need for the cross-disciplinary figure of the bioinformatician.On the contrary - it is essential to an effec-tive collaboration between bioinformati-cian and engineer that one have the skillsand vocabulary to communicate needs tothe other. The bioinformatician will veryoften communicate with the engineerusing source code. As suggested by Wilsonet al. it would be best if the bioinformati-cian also had a working knowledge of thebasic tools of software engineering such assource control and unit tests. But theresponsibility of identifying problems indesign and code, fixing them, and shapingexploratory code into well-engineered sol-utions would lie with the bioinformaticengineer. We predict that this would sub-stitute hours of drudgery for the scientistwith hours of true productivity, and at thesame time ensure performant, testable,maintainable and shareable code andreproducible results for the bioinformaticfield at large.

Recommendations

� Explicit recognition of the role of Bio-informatic Engineer, along with ashared understanding of the competen-cies, functions and interfaces of thatrole.

� The creation of specialist post-graduatecurricula to allow software engineeringgraduates to specialize in bioinformaticengineering. This should be seen as a

parallel and complementary effort tothe enlistment of computer science andbiology graduates into bioinformaticspost-graduate courses.

� Research into bioinformatic DomainSpecific Languages to facilitate collabo-ration between bioinformaticians andbioinformatic engineers.

� Adequate funding for software engi-neering as part of bioinformaticresearch projects.

� Measures to encourage the creation ofbioinformatic engineering companiesto service the needs of smaller researchteams which cannot afford dedicatedinternal bioinformatic engineering staff.Such companies could recruit andcross-train experienced commercialsoftware engineers as well as taking upmasters and PhD graduates from thespecialist bioinformatic engineeringcurricula we have suggested above.

Disclosure of Potential Conflicts of Interest

No potential conflicts of interest weredisclosed.

Funding

Paul Walsh is CTO of NSilico Life Sci-ence Ltd and is a Senior Research Fellowon FP7Marie Curie Actions IAPP ProjectClouDx-i under grant agreement no.324365.

References

1. Verma D, Gesell J, Siy H, Zand M. Lack of softwareengineering practices in the development of bioinfor-matics software. ICCGI 2013; 2013:57-62

2. Baxter SM, Day SW, Fetrow JS, Reisinger SJ. Scientificsoftware development is not an oxymoron. PLoScom-putational Biol 2006; 2:e87

3. Segal J. Some problems of professional end user devel-opers. Visual languages and human-centric computing,2007 vL/hCC 2007 iEEE symposium on 2007; 111-8;http://dx.doi.org/10.1109/VLHCC.2007.17

4. Morris C. Some lessons learned reviewing scientificcode. Proc 30th Intl Conference Software Eng(iCSE08) 2008

5. Prabhu P, Jablin TB, Raman A, Zhang Y, Huang J,Kim H, Johnson NP, Liu F, Ghosh S, Beard S, et al. Asurvey of the practice of computational science. StatePractice Rep 2011; 19

6. Schindewolf M, Biliari B, Gyllenhaal J, Schulz M,Wang A, Karl W. What scientific applications canbene-fit from hardware transactional memory? In: High per-formance computing, networking, storage and analysis(sC), 2012 international conference for. IEEE; 2012;1-11

7. Agha GA, Mason IA, Smith SF, Talcott CL. A founda-tion for actor computation. J Functional Programming

1997; 7:1-72; http://dx.doi.org/10.1017/S095679689700261X

8. Wiewi�orka MS, Messina A, Pacholewska A, MaffiolettiS, Gawrysiak P, Okoniewski MJ. SparkSeq: fast,scal-able, cloud-ready tool for the interactive genomic dataanalysis with nucleotide precision. Bioinformatics2014; btu343; PMID:24845651

9. Ioannidis JP. Why most published research findingsare false. PLoS Med 2005; 2:e124;PMID:16060722; http://dx.doi.org/10.1371/journal.pmed.0020124

10. Hannay JE, MacLeod C, Singer J, Langtangen HP,Pfahl D, Wilson G. How do scientists develop andusescientific software? Proceedings of the 2009 iCSE work-shop on software engineering for computational scienceand engineering 2009; 1-8; http://dx.doi.org/10.1109/SECSE.2009.5069155

11. Segal J, Morris C. Developing scientific software. Soft-ware IEEE 2008; 25:18-20; http://dx.doi.org/10.1109/MS.2008.85

12. Umarji M, Seaman C, Koru AG, Liu H. Software engi-neering education for bioinformatics. Softwareengineer-ing education and training, 2009 cSEET’09 22ndconference on 2009; 216-23; http://dx.doi.org/10.1109/CSEET.2009.44

13. Dijkstra EW. Selected Writings on Computing: a Per-sonal Perspective. Springer-Verlag New York, Inc. 1982

14. Surendran K, Hays H, Macfarlane A. Simulating a soft-ware engineering apprenticeship. Software, IEEE 2002;19:49-56; http://dx.doi.org/10.1109/MS.2002.1032854

15. Wilson G, Aruliah D, Brown CT, Hong NPC, DavisM, Guy RT, Haddock SH, Huff KD, Mitchell MI,Plumbley MD, et al. Best practices for scientific com-puting. PLoS Biol 2014; 12:e1001745.;PMID:24415924; http://dx.doi.org/10.1371/journal.pbio.1001745

16. Kane DW, Hohman MM, Cerami EG, McCormickMW, Kuhlmman KF, Byrd JA. Agile methods inbio-medical software development: a multi-site experiencereport. Bmc Bioinformatics 2006; 7:273.;PMID:16734914; http://dx.doi.org/10.1186/1471-2105-7-273

17. Van Deursen A, Klint P, Visser J. Domain-specific lan-guages: An annotated bibliography. Sigplan Notices2000; 35:26-36; http://dx.doi.org/10.1145/352029.352035

18. Kosar T, Barrientos PA, Mernik M, A preliminarystudy on various implementation approachesofdomain-specific language. Information Software Tech-nol 2008; 50:390-405; http://dx.doi.org/10.1016/j.infsof.2007.04.002

19. Swertz MA, Jansen RC. Beyond standardization:dynamic software infrastructures for systems biology.Nat Rev Genet 2007; 8:235-43; PMID:17297480;http://dx.doi.org/10.1038/nrg2048

20. Ruffin R. David ricardo’s discovery of comparativeadvantage. History Political Economy 2002; 34:727-48; http://dx.doi.org/10.1215/00182702-34-4-727

21. Sanders R, Kelly D. Dealing with risk in scientific soft-ware development. IEEE software 2008; 25:21-8;http://dx.doi.org/10.1109/MS.2008.84

22. Walsh P, Carroll J, Sleator RD. Accelerating in silicoresearch with workflows: a lesson in simplicity. Com-puters Biol Med 2013; 43:2028-35; http://dx.doi.org/10.1016/j.compbiomed.2013.09.011

23. Szigeti B, Gleeson P, Vella M, Khayrulin S, Palyanov A,Hokanson J, Currie M, Cantarelli M, Idili G, Larson S.OpenWorm: an open-science approach to modelingcaenorhabditis elegans. Frontiers Computational Neu-roscience 2014; 8; http://dx.doi.org/10.3389/fncom.2014.00137

24. Idili G, Cantarelli M, Buibas M, Busbice T, Coggan J,Grove C, Khayrulin S, Palyanov A, Larson SD.Managingcomplexity in multi-algorithm, multi-scale biological sim-ulations: an integrated software engineering and neuroin-formatics approach. Frontiers Neuroinformat ConferenceAbstract: 4th iNCF Congress Neuroinformat 2011; 112


Dow

nloa

ded

by [

194.

88.4

.145

] at

08:

31 1

5 Ju

ly 2

015

engineering bioinformatics: building reliability, performance and productivity into bioinformatics...

Documents