2015 12-18- avoid having to retract your genomics analysis - popgroup reproducible research...

Avoid retracting your analysis!

@yannick__ https://wurmlab.github.io

Reproducible research for ecological and evolutionary

genomics.

mailto:[email protected]

http://wurmlab.github.io

Geoffrey Chang: Crystallographer• Beckman Foundation Young Investigator

Award

• Presidential Early Career Award

Journal of Molecular Biology (2003) Chang. Structure of MsbA from Vibrio cholera: a multidrug resistance ABC transporter homolog in a closed conformation.

PNAS (2004) Ma & Chang. Structure of the multidrug resistance efflux transporter EmrE from Escherichia coli.Science (2005) Reyes & Chang. Structure of the ABC transporter MsbA in complex with ADP vanadate and lipopolysaccharide.

Science (2005) Pornillos et al. X-ray structure of the EmrE multidrug transporter in complex with a substrate.

Science (2001) Chang & Roth. Structure of MsbA from E. coli: a homolog of the multidrug resistance ATP binding cassette (ABC) transporters.

Science (2001) Chang & Roth.

1856

NEWS>>

THIS WEEK A dolphin’s

demise

Indians wary of

nuclear pact

1860 1863

Until recently, Geoffrey Chang’s career was ona trajectory most young scientists only dreamabout. In 1999, at the age of 28, the proteincrystallographer landed a faculty position atthe prestigious Scripps Research Institute inSan Diego, California. The next year, in a cer-emony at the White House, Chang received aPresidential Early Career Awardfor Scientists and Engineers, thecountry’s highest honor for youngresearchers. His lab generated astream of high-prof ile papersdetailing the molecular structuresof important proteins embedded incell membranes.

Then the dream turned into anightmare. In September, Swissresearchers published a paper inNature that cast serious doubt on aprotein structure Chang’s grouphad described in a 2001 Science

paper. When he investigated,Chang was horrified to discoverthat a homemade data-analysis pro-gram had flipped two columns ofdata, inverting the electron-densitymap from which his team hadderived the final protein structure.Unfortunately, his group had usedthe program to analyze data forother proteins. As a result, on page 1875,Chang and his colleagues retract three Science

papers and report that two papers in other jour-nals also contain erroneous structures.

“I’ve been devastated,” Chang says. “I hopepeople will understand that it was a mistake,and I’m very sorry for it.” Other researchersdon’t doubt that the error was unintentional,and although some say it has cost them timeand effort, many praise Chang for setting therecord straight promptly and forthrightly. “I’mvery pleased he’s done this because there hasbeen some confusion” about the original struc-tures, says Christopher Higgins, a biochemistat Imperial College London. “Now the fieldcan really move forward.”

The most influential of Chang’s retractedpublications, other researchers say, was the

2001 Science paper, which described the struc-ture of a protein called MsbA, isolated from thebacterium Escherichia coli. MsbA belongs to ahuge and ancient family of molecules that useenergy from adenosine triphosphate to trans-port molecules across cell membranes. Theseso-called ABC transporters perform many

essential biological duties and are of great clin-ical interest because of their roles in drug resist-ance. Some pump antibiotics out of bacterialcells, for example; others clear chemotherapydrugs from cancer cells. Chang’s MsbA struc-ture was the first molecular portrait of an entireABC transporter, and many researchers saw itas a major contribution toward figuring out howthese crucial proteins do their jobs. That paperalone has been cited by 364 publications,according to Google Scholar.

Two subsequent papers, both now beingretracted, describe the structure of MsbA fromother bacteria, Vibrio cholera (published inMolecular Biology in 2003) and Salmonella

typhimurium (published in Science in 2005).The other retractions, a 2004 paper in theProceedings of the National Academy of

Sciences and a 2005 Science paper, describedEmrE, a different type of transporter protein.

Crystallizing and obtaining structures offive membrane proteins in just over 5 yearswas an incredible feat, says Chang’s formerpostdoc adviser Douglas Rees of the Califor-nia Institute of Technology in Pasadena. Suchproteins are a challenge for crystallographersbecause they are large, unwieldy, and notori-ously diff icult to coax into the crystalsneeded for x-ray crystallography. Rees saysdetermination was at the root of Chang’s suc-cess: “He has an incredible drive and workethic. He really pushed the field in the sense

of getting things to crystallize thatno one else had been able to do.”Chang’s data are good, Rees says,but the faulty software threweverything off.

Ironically, another former post-doc in Rees’s lab, Kaspar Locher,exposed the mistake. In the 14 Sep-tember issue of Nature, Locher,now at the Swiss Federal Instituteof Technology in Zurich, describedthe structure of an ABC transportercalled Sav1866 from Staphylococcus

aureus. The structure was dramati-cally—and unexpectedly—differ-ent from that of MsbA. Afterpulling up Sav1866 and Chang’sMsbA from S. typhimurium on acomputer screen, Locher says herealized in minutes that the MsbAstructure was inverted. Interpretingthe “hand” of a molecule is alwaysa challenge for crystallographers,

Locher notes, and many mistakes can lead toan incorrect mirror-image structure. Gettingthe wrong hand is “in the category of monu-mental blunders,” Locher says.

On reading the Nature paper, Changquickly traced the mix-up back to the analysisprogram, which he says he inherited fromanother lab. Locher suspects that Changwould have caught the mistake if he’d takenmore time to obtain a higher resolution struc-ture. “I think he was under immense pressureto get the first structure, and that’s what madehim push the limits of his data,” he says. Oth-ers suggest that Chang might have caught theproblem if he’d paid closer attention to bio-chemical findings that didn’t jibe well with theMsbA structure. “When the first structurecame out, we and others said, ‘We really

A Scientist’s Nightmare: Software

Problem Leads to Five Retractions

SCIENTIFIC PUBLISHING

CR

ED

IT: R

. J. P.

DA

WS

ON

AN

D K

. P.

LO

CH

ER

, N

AT

UR

E4

43

, 1

80

( 2

00

6)

22 DECEMBER 2006 VOL 314 SCIENCE www.sciencemag.org

Flipping fiasco. The structures of MsbA (purple) and Sav1866 (green) overlap

little (left) until MsbA is inverted (right).

▲

Published by AAAS

on

Janu

ary

5, 2

007

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

1856

NEWS>>


demise

Indians wary of

nuclear pact

1860 1863





















CR

ED

IT: R

. J. P.

DA

WSO

N A

ND

K. P.

LO

CH

ER

, N

AT

UR

E4

43

, 180 ( 2

006)




▲

Published by AAAS

on

Janu

ary

5, 2

007

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

Sav1866 Dawson & Locher (2006) NatureScience (2001) Chang & Roth.Science (2001) Chang & Roth.

Comparison with 3D structure of ortholog

Science (2001) Chang & Roth.


www.sciencemag.org SCIENCE VOL 314 22 DECEMBER 2006 1875

Aquaculture in

Offshore Zones

THE EDITORIAL BY ROSAMOND NAYLOR,“Offshore aquaculture legislation” (8 Sept.,

p. 1363), suggests that the motivation for

moving aquaculture into the open ocean is

that “marine f ish farming near the shore

is limited by state regulations.” Although

unworkable regulations may exist in a few

states, in the larger scheme this is irrele-

vant. Of the offshore aquaculture projects

currently under way, none are occurring in

the U.S. Exclusive Economic Zone (EEZ);

rather, they are happening in state waters.

Even historically, only two aquaculture

projects have ever occurred in federal

waters (1).

Much of Naylor’s stated concern over

offshore aquaculture is based on historical

experience with near-shore fish farms. This

is in spite of years of more relevant offshore

operations that reveal little, if any, negative

impact on the environment or local ecosys-

tems (2, 3). Naylor criticizes the National

Offshore Aquaculture Act of 2005 because

it lacks specific environmental standards.

Yet, she recommends California’s recent

Sustainable Oceans Act as a legislative

model, although it is similarly silent, leaving

those details to rule-making in response to

the best available science.

Naylor criticizes the use of fishmeal as

an aquaculture ingredient, ignoring the fact

that industrial fisheries are well managed

and would occur with or without aquacul-

ture’s demand. Naylor ignores the higher

efficiency of using fishmeal to feed fish

compared with its use in land-based live-

stock operations (4). Also ignored is the

inefficiency of using small pelagic fish in

the natural setting to feed predator fish (5).

Researchers and entrepreneurs currently

developing the technologies needed for offshore

aquaculture share a vision of a well-managed

industry governed by regulations with a rational

basis in the ecology of the oceans and the eco-

nomic realities of the marketplace.CLIFFORD A. GOUDEY

Massachusetts Institute of Technology, Cambridge, MA02139, USA.

References and Notes1. The SeaStead project a decade ago, four miles off

Massachusetts (see www.nmfs.noaa.gov/mb/sk/saltonstallken/enhancement.htm) and the recentOffshore Aquaculture Consortium experimental cageoperation 22 miles off Mississippi (see www.masgc.org/oac/).

2. See www.lib.noaa.gov/docaqua/reports_noaaresearch/hooarrprept.htm/.

3. See www.blackpearlsinc.com/PDF/hoarpi.pdf.4. See www.salmonoftheamericas.com/env_food.html.5. D. Pauly, V. Christensen, Nature 374, 255 (2002).

IN HER PROVOCATIVE EDITORIAL “OFFSHOREaquaculture legislation” (8 Sept., p. 1363),

R. Naylor raises valid points regarding regu-

lation of oceanic aquaculture, since it is

sure to grow in the future because of dwin-

dling global fishery supplies. This growth is

LETTERS I BOOKS I POLICY FORUM I EDUCATION FORUM I PERSPECTIVES

1878

Generating new sciencein the classroom

How proteins connect

1880 1882

Mathematicalperspectives

LETTERSedited by Etta Kavanagh

Retraction

WE WISH TO RETRACT OUR RESEARCH ARTICLE “STRUCTURE OFMsbA from E. coli: A homolog of the multidrug resistance ATP bind-

ing cassette (ABC) transporters” and both of our Reports “Structure of

the ABC transporter MsbA in complex with ADP•vanadate and

lipopolysaccharide” and “X-ray structure of the EmrE multidrug trans-

porter in complex with a substrate” (1–3).

The recently reported structure of Sav1866 (4) indicated that our

MsbA structures (1, 2, 5) were incorrect in both the hand of the struc-

ture and the topology. Thus, our biological interpretations based on

these inverted models for MsbA are invalid.

An in-house data reduction program introduced a change in sign for

anomalous differences. This program, which was not part of a conven-

tional data processing package, converted the anomalous pairs (I+ and

I-) to (F- and F+), thereby introducing a sign change. As the diffrac-

tion data collected for each set of MsbA crystals and for the EmrE

crystals were processed with the same program, the structures reported

in (1–3, 5, 6) had the wrong hand.

The error in the topology of the original MsbA structure was a con-

sequence of the low resolution of the data as well as breaks in the elec-

tron density for the connecting loop regions. Unfortunately, the use of

the multicopy refinement procedure still allowed us to obtain reason-

able refinement values for the wrong structures.

The Protein Data Bank (PDB) files 1JSQ, 1PF4, and 1Z2R for

MsbA and 1S7B and 2F2M for EmrE have been moved to the archive

of obsolete PDB entries. The MsbA and EmrE structures will be

recalculated from the original data using the proper sign for the anom-

alous differences, and the new Ca coordinates and structure factors

will be deposited.

We very sincerely regret the confusion that these papers have

caused and, in particular, subsequent research efforts that were unpro-

ductive as a result of our original findings.GEOFFREY CHANG, CHRISTOPHER B. ROTH,

CHRISTOPHER L. REYES, OWEN PORNILLOS,

YEN-JU CHEN, ANDY P. CHEN

Department of Molecular Biology, The Scripps Research Institute, La Jolla, CA 92037, USA.

References1. G. Chang, C. B. Roth, Science 293, 1793 (2001).2. C. L. Reyes, G. Chang, Science 308, 1028 (2005).3. O. Pornillos, Y.-J. Chen, A. P. Chen, G. Chang, Science 310, 1950 (2005).4. R. J. Dawson, K. P. Locher, Nature 443, 180 (2006).5. G. Chang, J. Mol. Biol. 330, 419 (2003).6. C. Ma, G. Chang, Proc. Natl. Acad. Sci. U.S.A. 101, 2852 (2004).

COMMENTARY

Published by AAAS

on

Sept

embe

r 24,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n Se

ptem

ber 2

4, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Sept

embe

r 24,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

www.sciencemag.org SCIENCE VOL 314 22 DECEMBER 2006 1875

Aquaculture in

Offshore Zones

THE EDITORIAL BY ROSAMOND NAYLOR,“Offshore aquaculture legislation” (8 Sept.,

p. 1363), suggests that the motivation for

moving aquaculture into the open ocean is

that “marine f ish farming near the shore

is limited by state regulations.” Although

unworkable regulations may exist in a few

states, in the larger scheme this is irrele-

vant. Of the offshore aquaculture projects

currently under way, none are occurring in

the U.S. Exclusive Economic Zone (EEZ);

rather, they are happening in state waters.

Even historically, only two aquaculture

projects have ever occurred in federal

waters (1).

Much of Naylor’s stated concern over

offshore aquaculture is based on historical

experience with near-shore fish farms. This

is in spite of years of more relevant offshore

operations that reveal little, if any, negative

impact on the environment or local ecosys-

tems (2, 3). Naylor criticizes the National

Offshore Aquaculture Act of 2005 because

it lacks specific environmental standards.

Yet, she recommends California’s recent

Sustainable Oceans Act as a legislative

model, although it is similarly silent, leaving

those details to rule-making in response to

the best available science.

Naylor criticizes the use of fishmeal as

an aquaculture ingredient, ignoring the fact

that industrial fisheries are well managed

and would occur with or without aquacul-

ture’s demand. Naylor ignores the higher

efficiency of using fishmeal to feed fish

compared with its use in land-based live-

stock operations (4). Also ignored is the

inefficiency of using small pelagic fish in

the natural setting to feed predator fish (5).

Researchers and entrepreneurs currently

developing the technologies needed for offshore

aquaculture share a vision of a well-managed

industry governed by regulations with a rational

basis in the ecology of the oceans and the eco-

nomic realities of the marketplace.CLIFFORD A. GOUDEY

Massachusetts Institute of Technology, Cambridge, MA02139, USA.

References and Notes1. The SeaStead project a decade ago, four miles off

Massachusetts (see www.nmfs.noaa.gov/mb/sk/saltonstallken/enhancement.htm) and the recentOffshore Aquaculture Consortium experimental cageoperation 22 miles off Mississippi (see www.masgc.org/oac/).

2. See www.lib.noaa.gov/docaqua/reports_noaaresearch/hooarrprept.htm/.

3. See www.blackpearlsinc.com/PDF/hoarpi.pdf.4. See www.salmonoftheamericas.com/env_food.html.5. D. Pauly, V. Christensen, Nature 374, 255 (2002).

IN HER PROVOCATIVE EDITORIAL “OFFSHOREaquaculture legislation” (8 Sept., p. 1363),

R. Naylor raises valid points regarding regu-

lation of oceanic aquaculture, since it is

sure to grow in the future because of dwin-

dling global fishery supplies. This growth is

LETTERS I BOOKS I POLICY FORUM I EDUCATION FORUM I PERSPECTIVES

1878

Generating new sciencein the classroom

How proteins connect

1880 1882

Mathematicalperspectives

LETTERSedited by Etta Kavanagh

Retraction

WE WISH TO RETRACT OUR RESEARCH ARTICLE “STRUCTURE OFMsbA from E. coli: A homolog of the multidrug resistance ATP bind-

ing cassette (ABC) transporters” and both of our Reports “Structure of

the ABC transporter MsbA in complex with ADP•vanadate and

lipopolysaccharide” and “X-ray structure of the EmrE multidrug trans-

porter in complex with a substrate” (1–3).

The recently reported structure of Sav1866 (4) indicated that our

MsbA structures (1, 2, 5) were incorrect in both the hand of the struc-

ture and the topology. Thus, our biological interpretations based on

these inverted models for MsbA are invalid.

An in-house data reduction program introduced a change in sign for

anomalous differences. This program, which was not part of a conven-

tional data processing package, converted the anomalous pairs (I+ and

I-) to (F- and F+), thereby introducing a sign change. As the diffrac-

tion data collected for each set of MsbA crystals and for the EmrE

crystals were processed with the same program, the structures reported

in (1–3, 5, 6) had the wrong hand.

The error in the topology of the original MsbA structure was a con-

sequence of the low resolution of the data as well as breaks in the elec-

tron density for the connecting loop regions. Unfortunately, the use of

the multicopy refinement procedure still allowed us to obtain reason-

able refinement values for the wrong structures.

The Protein Data Bank (PDB) files 1JSQ, 1PF4, and 1Z2R for

MsbA and 1S7B and 2F2M for EmrE have been moved to the archive

of obsolete PDB entries. The MsbA and EmrE structures will be

recalculated from the original data using the proper sign for the anom-

alous differences, and the new Ca coordinates and structure factors

will be deposited.

We very sincerely regret the confusion that these papers have

caused and, in particular, subsequent research efforts that were unpro-

ductive as a result of our original findings.GEOFFREY CHANG, CHRISTOPHER B. ROTH,

CHRISTOPHER L. REYES, OWEN PORNILLOS,

YEN-JU CHEN, ANDY P. CHEN

Department of Molecular Biology, The Scripps Research Institute, La Jolla, CA 92037, USA.

References1. G. Chang, C. B. Roth, Science 293, 1793 (2001).2. C. L. Reyes, G. Chang, Science 308, 1028 (2005).3. O. Pornillos, Y.-J. Chen, A. P. Chen, G. Chang, Science 310, 1950 (2005).4. R. J. Dawson, K. P. Locher, Nature 443, 180 (2006).5. G. Chang, J. Mol. Biol. 330, 419 (2003).6. C. Ma, G. Chang, Proc. Natl. Acad. Sci. U.S.A. 101, 2852 (2004).

COMMENTARY

Published by AAAS

on

Sept

embe

r 24,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n Se

ptem

ber 2

4, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Sept

embe

r 24,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

1856

NEWS>>


demise

Indians wary of

nuclear pact

1860 1863





















CR

ED

IT: R

. J. P.

DA

WS

ON

AN

D K

. P.

LO

CH

ER

, N

AT

UR

E4

43

, 1

80

( 2

00

6)




▲

Published by AAAS

on

Janu

ary

5, 2

007

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m


😥

Geoffrey Chang• Beckman Foundation Young Investigator

Award

• Presidential Early Career Award

Science (2001) Chang & Roth. Structure of MsbA from E. coli: a homolog of the multidrug resistance ATP binding cassette (ABC) transporters.Journal of Molecular Biology (2003) Chang. Structure of MsbA from Vibrio cholera: a multidrug resistance ABC transporter homolog in a closed conformation.

PNAS (2004) Ma & Chang. Structure of the multidrug resistance efflux transporter EmrE from Escherichia coli.Science (2005) Reyes & Chang. Structure of the ABC transporter MsbA in complex with ADP vanadate and lipopolysaccharide.

Science (2005) Pornillos et al. X-ray structure of the EmrE multidrug transporter in complex with a substrate.

1856

NEWS>>


demise

Indians wary of

nuclear pact

1860 1863





















CR

ED

IT: R

. J. P.

DA

WS

ON

AN

D K

. P.

LO

CH

ER

, N

AT

UR

E4

43

, 1

80

( 2

00

6)




▲

Published by AAAS

on

Janu

ary

5, 2

007

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m


This is costlyFor: •the individual•collaborators•the institution•1000s of researchers performing follow-up work

•science•society



• Understanding/visualising/analysing/massaging big data is hard.• Biology/life is complex.• Biologists lack computational training. • Field is young.• Analysis tools (generally) suck:

• badly written• badly tested• hard to install• output quality… often questionable.

• Data sizes keep growing!• Data formats keep changing :(

Genome bioinformatics is hard Genome bioinformatics is harderthan (many) other data sciences


Some sources of inspiration


Community Page

Best Practices for Scientific ComputingGreg Wilson1*, D. A. Aruliah2, C. Titus Brown3, Neil P. Chue Hong4, Matt Davis5, Richard T. Guy6¤,

Steven H. D. Haddock7, Kathryn D. Huff8, Ian M. Mitchell9, Mark D. Plumbley10, Ben Waugh11,

Ethan P. White12, Paul Wilson13

1 Mozilla Foundation, Toronto, Ontario, Canada, 2 University of Ontario Institute of Technology, Oshawa, Ontario, Canada, 3 Michigan State University, East Lansing,

Michigan, United States of America, 4 Software Sustainability Institute, Edinburgh, United Kingdom, 5 Space Telescope Science Institute, Baltimore, Maryland, United

States of America, 6 University of Toronto, Toronto, Ontario, Canada, 7 Monterey Bay Aquarium Research Institute, Moss Landing, California, United States of America,

8 University of California Berkeley, Berkeley, California, United States of America, 9 University of British Columbia, Vancouver, British Columbia, Canada, 10 Queen Mary

University of London, London, United Kingdom, 11 University College London, London, United Kingdom, 12 Utah State University, Logan, Utah, United States of America,

13 University of Wisconsin, Madison, Wisconsin, United States of America

Introduction

Scientists spend an increasing amount of time building andusing software. However, most scientists are never taught how todo this efficiently. As a result, many are unaware of tools andpractices that would allow them to write more reliable andmaintainable code with less effort. We describe a set of bestpractices for scientific software development that have solidfoundations in research and experience, and that improvescientists’ productivity and the reliability of their software.

Software is as important to modern scientific research astelescopes and test tubes. From groups that work exclusively oncomputational problems, to traditional laboratory and fieldscientists, more and more of the daily operation of science revolvesaround developing new algorithms, managing and analyzing thelarge amounts of data that are generated in single researchprojects, combining disparate datasets to assess synthetic problems,and other computational tasks.

Scientists typically develop their own software for these purposesbecause doing so requires substantial domain-specific knowledge.As a result, recent studies have found that scientists typically spend30% or more of their time developing software [1,2]. However,90% or more of them are primarily self-taught [1,2], and thereforelack exposure to basic software development practices such aswriting maintainable code, using version control and issuetrackers, code reviews, unit testing, and task automation.

We believe that software is just another kind of experimentalapparatus [3] and should be built, checked, and used as carefullyas any physical apparatus. However, while most scientists arecareful to validate their laboratory and field equipment, most donot know how reliable their software is [4,5]. This can lead toserious errors impacting the central conclusions of publishedresearch [6]: recent high-profile retractions, technical comments,and corrections because of errors in computational methodsinclude papers in Science [7,8], PNAS [9], the Journal of MolecularBiology [10], Ecology Letters [11,12], the Journal of Mammalogy [13],Journal of the American College of Cardiology [14], Hypertension [15], andThe American Economic Review [16].

In addition, because software is often used for more than a singleproject, and is often reused by other scientists, computing errors canhave disproportionate impacts on the scientific process. This type ofcascading impact caused several prominent retractions when an

error from another group’s code was not discovered until afterpublication [6]. As with bench experiments, not everything must bedone to the most exacting standards; however, scientists need to beaware of best practices both to improve their own approaches andfor reviewing computational work by others.

This paper describes a set of practices that are easy to adopt andhave proven effective in many research settings. Our recommenda-tions are based on several decades of collective experience bothbuilding scientific software and teaching computing to scientists[17,18], reports from many other groups [19–25], guidelines forcommercial and open source software development [26,27], and onempirical studies of scientific computing [28–31] and softwaredevelopment in general (summarized in [32]). None of these practiceswill guarantee efficient, error-free software development, but used inconcert they will reduce the number of errors in scientific software,make it easier to reuse, and save the authors of the software time andeffort that can used for focusing on the underlying scientific questions.

Our practices are summarized in Box 1; labels in the main textsuch as ‘‘(1a)’’ refer to items in that summary. For reasons of space,we do not discuss the equally important (but independent) issues ofreproducible research, publication and citation of code and data,and open science. We do believe, however, that all of these will bemuch easier to implement if scientists have the skills we describe.

The Community Page is a forum for organizations and societies to highlight theirefforts to enhance the dissemination and value of scientific knowledge.

Citation: Wilson G, Aruliah DA, Brown CT, Chue Hong NP, Davis M, etal. (2014) Best Practices for Scientific Computing. PLoS Biol 12(1): e1001745.doi:10.1371/journal.pbio.1001745

Academic Editor: Jonathan A. Eisen, University of California Davis, United Statesof America

Published January 7, 2014

Copyright: ! 2014 Wilson et al. This is an open-access article distributed underthe terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided theoriginal author and source are credited.

Funding: Neil Chue Hong was supported by the UK Engineering and PhysicalSciences Research Council (EPSRC) Grant EP/H043160/1 for the UK SoftwareSustainability Institute. Ian M. Mitchell was supported by NSERC Discovery Grant#298211. Mark Plumbley was supported by EPSRC through a LeadershipFellowship (EP/G007144/1) and a grant (EP/H043101/1) for SoundSoftware.ac.uk.Ethan White was supported by a CAREER grant from the US National ScienceFoundation (DEB 0953694). Greg Wilson was supported by a grant from the SloanFoundation. The funders had no role in study design, data collection and analysis,decision to publish, or preparation of the manuscript.

Competing Interests: The lead author (GVW) is involved in a pilot study of codereview in scientific computing with PLOS Computational Biology.

* E-mail: [email protected]

¤ Current address: Microsoft, Inc., Seattle, Washington, United States ofAmerica

PLOS Biology | www.plosbiology.org 1 January 2014 | Volume 12 | Issue 1 | e1001745

Education

A Quick Guide to Organizing Computational BiologyProjectsWilliam Stafford Noble1,2*

1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and

Engineering, University of Washington, Seattle, Washington, United States of America

Introduction

Most bioinformatics coursework focus-es on algorithms, with perhaps somecomponents devoted to learning pro-gramming skills and learning how touse existing bioinformatics software. Un-fortunately, for students who are prepar-ing for a research career, this type ofcurriculum fails to address many of theday-to-day organizational challenges as-sociated with performing computationalexperiments. In practice, the principlesbehind organizing and documentingcomputational experiments are oftenlearned on the fly, and this learning isstrongly influenced by personal predilec-tions as well as by chance interactionswith collaborators or colleagues.

The purpose of this article is to describeone good strategy for carrying out com-putational experiments. I will not describeprofound issues such as how to formulatehypotheses, design experiments, or drawconclusions. Rather, I will focus onrelatively mundane issues such as organiz-ing files and directories and documentingprogress. These issues are importantbecause poor organizational choices canlead to significantly slower research pro-gress. I do not claim that the strategies Ioutline here are optimal. These are simplythe principles and practices that I havedeveloped over 12 years of bioinformaticsresearch, augmented with various sugges-tions from other researchers with whom Ihave discussed these issues.

Principles

The core guiding principle is simple:Someone unfamiliar with your projectshould be able to look at your computerfiles and understand in detail what you didand why. This ‘‘someone’’ could be any of avariety of people: someone who read yourpublished article and wants to try toreproduce your work, a collaborator whowants to understand the details of yourexperiments, a future student working inyour lab who wants to extend your workafter you have moved on to a new job, yourresearch advisor, who may be interested in

understanding your work or who may beevaluating your research skills. Most com-monly, however, that ‘‘someone’’ is you. Afew months from now, you may notremember what you were up to when youcreated a particular set of files, or you maynot remember what conclusions you drew.You will either have to then spend timereconstructing your previous experimentsor lose whatever insights you gained fromthose experiments.

This leads to the second principle,which is actually more like a version ofMurphy’s Law: Everything you do, youwill probably have to do over again.Inevitably, you will discover some flaw inyour initial preparation of the data beinganalyzed, or you will get access to newdata, or you will decide that your param-eterization of a particular model was notbroad enough. This means that theexperiment you did last week, or eventhe set of experiments you’ve been work-ing on over the past month, will probablyneed to be redone. If you have organizedand documented your work clearly, thenrepeating the experiment with the newdata or the new parameterization will bemuch, much easier.

To see how these two principles areapplied in practice, let’s begin by consid-ering the organization of directories andfiles with respect to a particular project.

File and Directory Organization

When you begin a new project, youwill need to decide upon some organiza-tional structure for the relevant directo-ries. It is generally a good idea to storeall of the files relevant to one project

under a common root directory. Theexception to this rule is source code orscripts that are used in multiple projects.Each such program might have a projectdirectory of its own.

Within a given project, I use a top-levelorganization that is logical, with chrono-logical organization at the next level, andlogical organization below that. A sampleproject, called msms, is shown in Figure 1.At the root of most of my projects, I have adata directory for storing fixed data sets, aresults directory for tracking computa-tional experiments peformed on that data,a doc directory with one subdirectory permanuscript, and directories such as srcfor source code and bin for compiledbinaries or scripts.

Within the data and results directo-ries, it is often tempting to apply a similar,logical organization. For example, youmay have two or three data sets againstwhich you plan to benchmark youralgorithms, so you could create onedirectory for each of them under data.In my experience, this approach is risky,because the logical structure of your finalset of experiments may look drasticallydifferent from the form you initiallydesigned. This is particularly true underthe results directory, where you maynot even know in advance what kinds ofexperiments you will need to perform. Ifyou try to give your directories logicalnames, you may end up with a very longlist of directories with names that, sixmonths from now, you no longer knowhow to interpret.

Instead, I have found that organizingmy data and results directories chro-nologically makes the most sense. Indeed,

Citation: Noble WS (2009) A Quick Guide to Organizing Computational Biology Projects. PLoS ComputBiol 5(7): e1000424. doi:10.1371/journal.pcbi.1000424

Editor: Fran Lewitter, Whitehead Institute, United States of America

Published July 31, 2009

Copyright: ! 2009 William Stafford Noble. This is an open-access article distributed under the terms of theCreative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in anymedium, provided the original author and source are credited.

Funding: The author received no specific funding for writing this article.

Competing Interests: The author has declared that no competing interests exist.


PLoS Computational Biology | www.ploscompbiol.org 1 July 2009 | Volume 5 | Issue 7 | e1000424


http://software.ac.uk

http://software.ac.uk


Specific Approaches/Tools

1. Write code for humans



Write code for humans (not computers!)• For

• yourself• colleagues / collaborators• reviewers• other random people who may reuse/improve your code

• Respect conventions (e.g., a style guide)


Programming better

• variable naming

• coding width: 100 characters

• indenting

• Follow conventions -eg “Google R Style”

• Versioning: DropBox & http://github.com/

• Automated testing

• “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway

preprocess_snps <- function(snp_table, testing=FALSE) { if (testing) { # run a bunch of tests of extreme situations. # quit if a test gives a weird result. } # real part of function. }

Friday, 22 June 12

Use whitespace/indentation!

Programming better

• variable naming


• indenting






Friday, 22 June 12

Programming better

• variable naming


• indenting






Friday, 22 June 12

Same information

Line length Strive to limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font. If you find yourself running out of room, this is a good indication that you should encapsulate some of the work in a separate function.

ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header=TRUE, sep='\t', col.names = c('colony', 'individual', 'headwidth', ‘mass'))

ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header = TRUE, sep = '\t', col.names = c('colony', 'individual', 'headwidth', 'mass') )

ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header=TRUE, sep='\t', col.names = c('colony', 'individual', 'headwidth', 'mass'))

R style guide extracthttp://r-pkgs.had.co.nz/style.html

http://r-pkgs.had.co.nz/style.html

R style guide extracthttp://r-pkgs.had.co.nz/style.html

http://r-pkgs.had.co.nz/style.html


Write code for humans (not computers!)• For

• yourself• colleagues / collaborators• reviewers• other random people who may want to reuse your code

• Respect conventions (e.g., a style guide)

• Don't optimise (generally…)





2. Organise mindfully


Eliminate redundancyDRY: Don’t Repeat Yourself

Organise mindfully

& don't reinvent the wheel.

Education

A Quick Guide to Organizing Computational BiologyProjectsWilliam Stafford Noble1,2*

1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and

Engineering, University of Washington, Seattle, Washington, United States of America

Introduction

Most bioinformatics coursework focus-es on algorithms, with perhaps somecomponents devoted to learning pro-gramming skills and learning how touse existing bioinformatics software. Un-fortunately, for students who are prepar-ing for a research career, this type ofcurriculum fails to address many of theday-to-day organizational challenges as-sociated with performing computationalexperiments. In practice, the principlesbehind organizing and documentingcomputational experiments are oftenlearned on the fly, and this learning isstrongly influenced by personal predilec-tions as well as by chance interactionswith collaborators or colleagues.

The purpose of this article is to describeone good strategy for carrying out com-putational experiments. I will not describeprofound issues such as how to formulatehypotheses, design experiments, or drawconclusions. Rather, I will focus onrelatively mundane issues such as organiz-ing files and directories and documentingprogress. These issues are importantbecause poor organizational choices canlead to significantly slower research pro-gress. I do not claim that the strategies Ioutline here are optimal. These are simplythe principles and practices that I havedeveloped over 12 years of bioinformaticsresearch, augmented with various sugges-tions from other researchers with whom Ihave discussed these issues.

Principles

The core guiding principle is simple:Someone unfamiliar with your projectshould be able to look at your computerfiles and understand in detail what you didand why. This ‘‘someone’’ could be any of avariety of people: someone who read yourpublished article and wants to try toreproduce your work, a collaborator whowants to understand the details of yourexperiments, a future student working inyour lab who wants to extend your workafter you have moved on to a new job, yourresearch advisor, who may be interested in

understanding your work or who may beevaluating your research skills. Most com-monly, however, that ‘‘someone’’ is you. Afew months from now, you may notremember what you were up to when youcreated a particular set of files, or you maynot remember what conclusions you drew.You will either have to then spend timereconstructing your previous experimentsor lose whatever insights you gained fromthose experiments.

This leads to the second principle,which is actually more like a version ofMurphy’s Law: Everything you do, youwill probably have to do over again.Inevitably, you will discover some flaw inyour initial preparation of the data beinganalyzed, or you will get access to newdata, or you will decide that your param-eterization of a particular model was notbroad enough. This means that theexperiment you did last week, or eventhe set of experiments you’ve been work-ing on over the past month, will probablyneed to be redone. If you have organizedand documented your work clearly, thenrepeating the experiment with the newdata or the new parameterization will bemuch, much easier.

To see how these two principles areapplied in practice, let’s begin by consid-ering the organization of directories andfiles with respect to a particular project.

File and Directory Organization

When you begin a new project, youwill need to decide upon some organiza-tional structure for the relevant directo-ries. It is generally a good idea to storeall of the files relevant to one project

under a common root directory. Theexception to this rule is source code orscripts that are used in multiple projects.Each such program might have a projectdirectory of its own.

Within a given project, I use a top-levelorganization that is logical, with chrono-logical organization at the next level, andlogical organization below that. A sampleproject, called msms, is shown in Figure 1.At the root of most of my projects, I have adata directory for storing fixed data sets, aresults directory for tracking computa-tional experiments peformed on that data,a doc directory with one subdirectory permanuscript, and directories such as srcfor source code and bin for compiledbinaries or scripts.

Within the data and results directo-ries, it is often tempting to apply a similar,logical organization. For example, youmay have two or three data sets againstwhich you plan to benchmark youralgorithms, so you could create onedirectory for each of them under data.In my experience, this approach is risky,because the logical structure of your finalset of experiments may look drasticallydifferent from the form you initiallydesigned. This is particularly true underthe results directory, where you maynot even know in advance what kinds ofexperiments you will need to perform. Ifyou try to give your directories logicalnames, you may end up with a very longlist of directories with names that, sixmonths from now, you no longer knowhow to interpret.

Instead, I have found that organizingmy data and results directories chro-nologically makes the most sense. Indeed,

Citation: Noble WS (2009) A Quick Guide to Organizing Computational Biology Projects. PLoS ComputBiol 5(7): e1000424. doi:10.1371/journal.pcbi.1000424

Editor: Fran Lewitter, Whitehead Institute, United States of America

Published July 31, 2009

Copyright: ! 2009 William Stafford Noble. This is an open-access article distributed under the terms of theCreative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in anymedium, provided the original author and source are credited.

Funding: The author received no specific funding for writing this article.

Competing Interests: The author has declared that no competing interests exist.



with this approach, the distinction be-tween data and results may not be useful.Instead, one could imagine a top-leveldirectory called something like experi-ments, with subdirectories with names like2008-12-19. Optionally, the directoryname might also include a word or twoindicating the topic of the experimenttherein. In practice, a single experimentwill often require more than one day ofwork, and so you may end up working afew days or more before creating a newsubdirectory. Later, when you or someoneelse wants to know what you did, thechronological structure of your work willbe self-evident.

Below a single experiment directory, theorganization of files and directories islogical, and depends upon the structureof your experiment. In many simpleexperiments, you can keep all of your filesin the current directory. If you startcreating lots of files, then you shouldintroduce some directory structure to storefiles of different types. This directorystructure will typically be generated auto-matically from a driver script, as discussedbelow.

The Lab Notebook

In parallel with this chronologicaldirectory structure, I find it useful tomaintain a chronologically organized labnotebook. This is a document that residesin the root of the results directory andthat records your progress in detail.Entries in the notebook should be dated,and they should be relatively verbose, withlinks or embedded images or tablesdisplaying the results of the experimentsthat you performed. In addition to de-scribing precisely what you did, thenotebook should record your observations,conclusions, and ideas for future work.Particularly when an experiment turns outbadly, it is tempting simply to link the finalplot or table of results and start a newexperiment. Before doing that, it isimportant to document how you knowthe experiment failed, since the interpre-tation of your results may not be obviousto someone else reading your lab note-book.

In addition to the primary text describ-ing your experiments, it is often valuableto transcribe notes from conversations aswell as e-mail text into the lab notebook.

These types of entries provide a completepicture of the development of the projectover time.

In practice, I ask members of myresearch group to put their lab notebooksonline, behind password protection ifnecessary. When I meet with a memberof my lab or a project team, we can referto the online lab notebook, focusing onthe current entry but scrolling up toprevious entries as necessary. The URLcan also be provided to remote collabo-rators to give them status updates on theproject.

Note that if you would rather not createyour own ‘‘home-brew’’ electronic note-book, several alternatives are available.For example, a variety of commercialsoftware systems have been created tohelp scientists create and maintain elec-tronic lab notebooks [1–3]. Furthermore,especially in the context of collaborations,storing the lab notebook on a wiki-basedsystem or on a blog site may be appealing.

Figure 1. Directory structure for a sample project. Directory names are in large typeface, and filenames are in smaller typeface. Only a subset ofthe files are shown here. Note that the dates are formatted ,year.-,month.-,day. so that they can be sorted in chronological order. Thesource code src/ms-analysis.c is compiled to create bin/ms-analysis and is documented in doc/ms-analysis.html. The READMEfiles in the data directories specify who downloaded the data files from what URL on what date. The driver script results/2009-01-15/runallautomatically generates the three subdirectories split1, split2, and split3, corresponding to three cross-validation splits. The bin/parse-sqt.py script is called by both of the runall driver scripts.doi:10.1371/journal.pcbi.1000424.g001


In each results folder :•script getResults.rb •intermediates•output

Organise mindfully





3. Plan for mistakes


Automatically check consistency with style guide

install.packages("lint") # once

library(lint) # everytime lint("file_to_check.R")


Create code tests that are easy to run• Unit tests == checking edge cases to see if the function works

# do your stuff # e.g. define speed() function

library(testthat)

expect_that(speed(km = 0, minutes = 60), equals(0)) expect_that(speed(km = 60, minutes = 60), equals(1)) expect_that(speed(km = -4, minutes = 60), throws_error()) expect_that(nrow(significant_SNPs), 42) expect_that(my_model, is_a("lm"))

• Integration tests == "full analysis" but on small data with known results

• e.g. on fake VCF genotype file of 2 loci (one true positive, one true negative)



"Continuous integration": Tests should run automagically.

So you don't have to remember (or find time) to do it.

💾http://github.org

Tests run automaticallyhttp://travis-ci.org

If unexpected result:📬



Perform "Sanity checks"



Code reviews: ask a peer to (critically) read your analysis code.






3. Plan for mistakes

4. Use tools that reduce risks


knitr/sweave Analysis & report in one step.

analysis.Rmd

A minimal R Markdown example

I know the value of pi is 3.1416, and 2 times pi is 6.2832. To compile me type:

library(knitr); knit(�minimal.Rmd�)

A paragraph here. A code chunk below:

1+1

## [1] 2

.4-.7+.3 # what? it is not zero!

## [1] 5.551e-17

Graphics work too

library(ggplot2)

qplot(speed, dist, data = cars) + geom_smooth()

●●

●

●●

●●●●

●●

●●●● ●

●●

●

●●

●

●

●●

●

●●

●●●

●

●

●●

●●

●

●

●●●● ●

●

●

●●

●

●

0

40

80

120

5 10 15 20 25speed

dist

Figure 1: A scatterplot of cars

1

& jupyter


If you need to make a "pipeline"

• Use "pipelining" software. E.g.:

• Snakemake

• Nextflow

• (etc)



Specific Approaches/Tools1. Write code for humans 2. Organise mindfully 3. Plan for mistakes 4. Use tools that reduce risks

Bruno Vieira

Anurag Priyam

Ismail Moghul

Roddy Pracana

@bmpvieira@yeban @RoddyPracana

Joe Colgan

http

s://w

urm

lab.gi

thub

.io


https://wurmlab.github.io