reproducible research in...

13
Reproducible Research in Bioinformatics Keith A. Baggerly Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center [email protected] SAMSI, Sep 10, 2014

Upload: others

Post on 07-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reproducible Research in Bioinformaticssamsi.info/wp-content/uploads/2016/03/Baggerly_panel_sept2014.pdf · Reproducible Research in Bioinformatics Keith A. Baggerly Bioinformatics

Reproducible Research inBioinformatics

Keith A. BaggerlyBioinformatics and Computational Biology

UT M. D. Anderson Cancer [email protected]

SAMSI, Sep 10, 2014

Page 2: Reproducible Research in Bioinformaticssamsi.info/wp-content/uploads/2016/03/Baggerly_panel_sept2014.pdf · Reproducible Research in Bioinformatics Keith A. Baggerly Bioinformatics

GENOMIC SIGNATURES 1

Why is Reproducibility Important in Genomics?With “Big Data” In General?

Our intuition about what “makes sense” is very poor in highdimensions.

If we want to use patterns we find in the data to direct patientcare, we need to know they’ve been assembled correctly.

If the results aren’t readily checkable, we may need to employ(lengthy!) forensic bioinformatics to reconstruct what wasdone.

For better or worse, we’ve done this a lot.

Page 3: Reproducible Research in Bioinformaticssamsi.info/wp-content/uploads/2016/03/Baggerly_panel_sept2014.pdf · Reproducible Research in Bioinformatics Keith A. Baggerly Bioinformatics

GENOMIC SIGNATURES 2

What are the Genes?: CAMDA 2002

−40 −20 0 20 40 60 80

−60

−40

−20

020

4060

Component 1

Com

pone

nt 2

●●

●●●

●●●

●●●

●●●

●●●

●●

●● ●●

●●● ●

●●

● ●

●●

●●●

●●

●●

Exp KidneyExp LiverExp TestisRef KidneyRef LiverRef Testis

There should be 4 clusters. Excel split two.

Page 4: Reproducible Research in Bioinformaticssamsi.info/wp-content/uploads/2016/03/Baggerly_panel_sept2014.pdf · Reproducible Research in Bioinformatics Keith A. Baggerly Bioinformatics

GENOMIC SIGNATURES 3

What are the Samples?: TCGA 2010

●●●●●●●●●●●●●

●●

●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−20

−10

010

20

Misfit Between Predicted and Observed Level 2 Values

Sample Batch

Log2

(Sum

of A

bsol

ute

Dev

iatio

ns)

09 11 12 13 14 15 17 18 19 21 22 24

Ovarian miR Data;Level 1 to Level 2

Correlations > 0.999 For 6 Odd Samples

Names Derived

Nam

es R

epor

ted

636 629 633 631 628 630

630

628

631

633

629

636

Correlations between namedand derived profiles

How were the data processed?

Page 5: Reproducible Research in Bioinformaticssamsi.info/wp-content/uploads/2016/03/Baggerly_panel_sept2014.pdf · Reproducible Research in Bioinformatics Keith A. Baggerly Bioinformatics

GENOMIC SIGNATURES 4

How was the Experiment Performed? Prot 2004

Machine breaking. All controls run first.

Page 6: Reproducible Research in Bioinformaticssamsi.info/wp-content/uploads/2016/03/Baggerly_panel_sept2014.pdf · Reproducible Research in Bioinformatics Keith A. Baggerly Bioinformatics

GENOMIC SIGNATURES 5

Are we Learning? NYT 04, 08, Nat Gen 10

Feb 4, 2004 Aug 26, 2008

It’s not just one assay.Leek et al. (2010), Nat Gen 11:733-9.Batch effects are pervasive.

Page 7: Reproducible Research in Bioinformaticssamsi.info/wp-content/uploads/2016/03/Baggerly_panel_sept2014.pdf · Reproducible Research in Bioinformatics Keith A. Baggerly Bioinformatics

GENOMIC SIGNATURES 6

Is This an Isolated Problem?

Ioannidis et al. (2009), Nat. Gen., 41:149-55. Testedreproducibility of microarray papers. Could reproduce 2/18.

Begley and Ellis (2012), Nature, 483:531-3. Amgenattempted validation of clinical “breakthroughs” prior tofurther study. Validated 6/53.

Page 8: Reproducible Research in Bioinformaticssamsi.info/wp-content/uploads/2016/03/Baggerly_panel_sept2014.pdf · Reproducible Research in Bioinformatics Keith A. Baggerly Bioinformatics

GENOMIC SIGNATURES 7

Recent Links

Science, March 6, 2013 http://www.aaas.org/news/releases/2013/0311_alberts.shtmlNature, April 24, 2013 http://www.nature.com/news/announcement-reducing-our-irreproducibility-1.12852Colbert report, April 23, 2013 http://www.colbertnation.com/the-colbert-report-videos/425749/april-23-2013/austerity-s-spreadsheet-error---thomas-herndonNature, BMC Medicine, Oct 17, 2013http://www.nature.com/nature/journal/v502/n7471/full/nature12564.html,http://www.biomedcentral.com/1741-7015/11/220

Page 9: Reproducible Research in Bioinformaticssamsi.info/wp-content/uploads/2016/03/Baggerly_panel_sept2014.pdf · Reproducible Research in Bioinformatics Keith A. Baggerly Bioinformatics

GENOMIC SIGNATURES 8

Everything Old is New Again...

“To consult the statistician after an experiment is finished isoften merely to ask him to conduct a post mortemexamination. He can perhaps say what the experiment diedof.” RA Fisher, 1938. Sankhya 4, 14-17.

The most common mistakes are simple.

Confounding in the Experimental DesignMixing up the sample labelsMixing up the gene labelsMixing up the group labels(Most mixups involve simple switches or offsets)

This simplicity is often hidden.

Incomplete documentation

Page 10: Reproducible Research in Bioinformaticssamsi.info/wp-content/uploads/2016/03/Baggerly_panel_sept2014.pdf · Reproducible Research in Bioinformatics Keith A. Baggerly Bioinformatics

GENOMIC SIGNATURES 9

So, How Do We Do It?

By fiat

Literate Programming

R

Sweave (R + Latex)

RStudio/knitr/markdown - very low barriers to entry!

Producing reports on a regular basis

Page 11: Reproducible Research in Bioinformaticssamsi.info/wp-content/uploads/2016/03/Baggerly_panel_sept2014.pdf · Reproducible Research in Bioinformatics Keith A. Baggerly Bioinformatics

GENOMIC SIGNATURES 10

Structuring Reports

“We have the duty of formulating, of summarizing, and ofcommunicating our conclusions, in intelligible form, inrecognition of the right of other free minds to utilize them inmaking their own decisions.” RA Fisher, Statistical methodsand scientific induction. JRSS-B, 17, 69-78, 1955.

IntroductionData and MethodsResultsConclusionsAppendices

Page 12: Reproducible Research in Bioinformaticssamsi.info/wp-content/uploads/2016/03/Baggerly_panel_sept2014.pdf · Reproducible Research in Bioinformatics Keith A. Baggerly Bioinformatics

GENOMIC SIGNATURES 11

Challenges

Reports vs Projects (Elizabeth!)

Freezes, Annotation builds

Sampling when Reproduction is Infeasible (Susan Holmes)

Data Volumes

Page 13: Reproducible Research in Bioinformaticssamsi.info/wp-content/uploads/2016/03/Baggerly_panel_sept2014.pdf · Reproducible Research in Bioinformatics Keith A. Baggerly Bioinformatics

GENOMIC SIGNATURES 12

Where Can We Learn More?

Free course at Courserahttps://www.coursera.org/course/repdata

Data Scientist’s Toolboxhttps://www.coursera.org/course/datascitoolbox

Yihui Xie’s book on knitr, and his presentation to the LA RUser Group from May 2014https://www.youtube.com/watch?v=2yvW0O_7xOg

Christopher Gandrud’s book

GitHub