no boundary thinking in bioinformatics workshop keynote
TRANSCRIPT
The bounty of the commons.
Casey Greene @GreeneScientist
Image: William R Shepherd
It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.
- Attributed to Mark Twain
Everybody knows there are 4 subtypes of HGSC.
… everybody but Greg.
Tothill et al. Clinical Cancer Research. 2008
One hundred and seventy one tumors consistently segregated into one of the six k-means clusters. Most of the remaining tumors (80 of 114) could be further assigned to one of the molecular subsets by performing class prediction.
171 clustered cleanly
80 could be assigned 34 ???
12-40% unclear
The Cancer Genome Atlas, Nature. 2011
The silhouette width was computed to filter out expression profiles that were included in a subclass, but that were not robust representatives of the subclass. This resulted in the removal of 51 of 135 samples of the Differentiated subclass; 12 of 107 samples of the Immunoreactive subclass; 0 of 109 samples of the Mesenchymal subclass; and 13 of 138 samples of the Proliferative subclass..
Verhaak et al. JCI. 2013
Unified Bioinformatics Pipeline curatedOvarianData
Remove • <130 tumors • Custom array
technology
Clustering
Analyses
SAM
Overrepresented Pathways
Survival
Match Clusters
Dataset Inclusion Criteria
TCGA
Tothill
Yoshihara
Bonome
*Our group deposited 528 samples to GEO (GSE74357)
Mayo*
Keep • Histology • High Grade
Sample Inclusion Criteria
Gene Selection Criteria
Keep • 1500 MAD • Union
Remove • <130 tumors • Custom array
technology
Clustering
Analyses
SAM
Overrepresented Pathways
Survival
Match Clusters
Dataset Inclusion Criteria
TCGA
Tothill
Yoshihara
Bonome
*Our group deposited 528 samples to GEO (GSE74357)
Mayo*
Keep • Histology • High Grade
Sample Inclusion Criteria
Gene Selection Criteria
Keep • 1500 MAD • Union
Unified Bioinformatics Pipeline
Are HGSC subtypes consistent?
k-Means and NMF are consistent.
Cross-population analysis does not support four subtypes.
Why didn’t TCGA (2011) find this?
Why didn’t Konecny (2014) find this?
What about TCGA’s re-analysis of Tothill?
What if you re-analyze Tothill without LMP samples?
What if you re-analyze Tothill with LMP samples?
Comprehensive cross-population analysis of high-grade serous ovarian cancer supports no more than three subtypes bioRxiv: http://dx.doi.org/10.1101/030239 github: http://github.com/greenelab/hgsc_subtypes
Research is to see what everybody else has seen and to think what
nobody else has thought.�- Albert Szent-Györgyi
Image by J.W. McGuire/NIH
Image from You Don’t Know Jack. Vol 3.
If you showed 16,000 computers 10 million images from youtube, what would they see?
Le et al. 2012
• Pseudomonas aeruginosa compendium • > 100 different experiments • Many different labs
Analysis with Denoising Autoencoders of �Gene Expression (ADAGE)
Tan et al. Pac Sym Bio 2015; Tan et al. mSystems 2016.
The worst thing that can be said about your grant is that it's a fishing expedition.
�- Tom Blumenthal (@ 2016 Gold Lab Symposium)
�- Tom Blumenthal (@ 2016 Gold Lab Symposium)
If you engage in a fishing expedition you can actually catch fish.�
Image by Peter (stockispicts)
The Transcription Factor Anr Controls P.a. Response to Low O2
Low O2
O2
O2
O2
O2
O2 O2
O2 O2
O2
O2
O2
O2
O2
O2
O2 O2
O2
O2 O2
O2
O2
O2 O2
O2
O2
O2 O2 O2
O2 O2
O2
O2
O2
Anr
CF Lung Epithelium
High-weight genes
Node42 reflects Anr Activity
E−GEOD−17179
} wt
}}Δanr
Δdnr
E−GEOD−17296
}}}}}}
ΔanrΔroxSR
ΔanrΔroxSR
wt
wt
}}
EXP
STAT
O2
E−GEO
D−52445
O2
Node42 - Anr ActivityE−G
EOD
−33160
O2
A
B
−15 0 10Value
Color KeyColor Key
−10 0 10Value
Color Key
Value−10 0 10
−10 0 15
Color Key
Value
}}Δanr
wt
}}Δanr
wt }}Δanr
wt
−5 0 5
Color Key
Value
Color Key
Value−4 0 4
Color Key
Value−2 0 2
Microarray RNAseq PAO1
RNAseq J215
C
New Experiment Validates Node 42’s Low-O2 Signature
CF lung epithelial cells Jack Hammond
E−GEOD−17179
} wt
}}Δanr
Δdnr
E−GEOD−17296
}}}}}}
ΔanrΔroxSR
ΔanrΔroxSR
wt
wt
}}
EXP
STAT
O2
E−GEO
D−52445
O2
Node42 - Anr Activity
E−GEO
D−33160
O2
A
B
−15 0 10Value
Color KeyColor Key
−10 0 10Value
Color Key
Value−10 0 10
−10 0 15
Color Key
Value
}}Δanr
wt
}}Δanr
wt }}Δanr
wt
−5 0 5
Color Key
Value
Color Key
Value−4 0 4
Color Key
Value−2 0 2
Microarray RNAseq PAO1
RNAseq J215
CE−GEOD−17179
} wt
}}Δanr
Δdnr
E−GEOD−17296
}}}}}}
ΔanrΔroxSR
ΔanrΔroxSR
wt
wt
}}
EXP
STAT
O2
E−GEO
D−52445
O2
Node42 - Anr Activity
E−GEO
D−33160
O2
A
B
−15 0 10Value
Color KeyColor Key
−10 0 10Value
Color Key
Value−10 0 10
−10 0 15
Color Key
Value
}}Δanr
wt
}}Δanr
wt }}Δanr
wt
−5 0 5
Color Key
Value
Color Key
Value−4 0 4
Color Key
Value−2 0 2
Microarray RNAseq PAO1
RNAseq J215
C
Cross-platform normalization of microarray and RNA-seq data for machine learning applications
Thompson, Tan, Greene. PeerJ. Jeff Thompson
Cross-platform normalization of microarray and RNA-seq data for machine learning applications
Thompson, Tan, Greene. PeerJ. 2016
New Experiment Validates Node 42’s Low-O2 Signature
CF lung epithelial cells Jack Hammond
E−GEOD−17179
} wt
}}Δanr
Δdnr
E−GEOD−17296
}}}}}}
ΔanrΔroxSR
ΔanrΔroxSR
wt
wt
}}
EXP
STAT
O2
E−GEO
D−52445
O2
Node42 - Anr Activity
E−GEO
D−33160
O2
A
B
−15 0 10Value
Color KeyColor Key
−10 0 10Value
Color Key
Value−10 0 10
−10 0 15
Color Key
Value
}}Δanr
wt
}}Δanr
wt }}Δanr
wt
−5 0 5
Color Key
Value
Color Key
Value−4 0 4
Color Key
Value−2 0 2
Microarray RNAseq PAO1
RNAseq J215
CE−GEOD−17179
} wt
}}Δanr
Δdnr
E−GEOD−17296
}}}}}}
ΔanrΔroxSR
ΔanrΔroxSR
wt
wt
}}
EXP
STAT
O2
E−GEO
D−52445
O2
Node42 - Anr Activity
E−GEO
D−33160
O2
A
B
−15 0 10Value
Color KeyColor Key
−10 0 10Value
Color Key
Value−10 0 10
−10 0 15
Color Key
Value
}}Δanr
wt
}}Δanr
wt }}Δanr
wt
−5 0 5
Color Key
Value
Color Key
Value−4 0 4
Color Key
Value−2 0 2
Microarray RNAseq PAO1
RNAseq J215
C
E−GEOD−17179
} wt
}}Δanr
Δdnr
E−GEOD−17296
}}}}}}
ΔanrΔroxSR
ΔanrΔroxSR
wt
wt
}}
EXP
STAT
O2
E−GEO
D−52445
O2
Node42 - Anr Activity
E−GEO
D−33160
O2
A
B
−15 0 10Value
Color KeyColor Key
−10 0 10Value
Color Key
Value−10 0 10
−10 0 15
Color Key
Value
}}Δanr
wt
}}Δanr
wt }}Δanr
wt
−5 0 5
Color Key
Value
Color Key
Value−4 0 4
Color Key
Value−2 0 2
Microarray RNAseq PAO1
RNAseq J215
C
E−GEOD−17179
} wt
}}Δanr
Δdnr
E−GEOD−17296
}}}}}}
ΔanrΔroxSR
ΔanrΔroxSR
wt
wt
}}
EXP
STAT
O2
E−GEO
D−52445
O2
Node42 - Anr Activity
E−GEO
D−33160
O2
A
B
−15 0 10Value
Color KeyColor Key
−10 0 10Value
Color Key
Value−10 0 10
−10 0 15
Color Key
Value
}}Δanr
wt
}}Δanr
wt }}Δanr
wt
−5 0 5
Color Key
Value
Color Key
Value−4 0 4
Color Key
Value−2 0 2
Microarray RNAseq PAO1
RNAseq J215
C
E−GEOD−17179
} wt
}}Δanr
Δdnr
E−GEOD−17296
}}}}}}
ΔanrΔroxSR
ΔanrΔroxSR
wt
wt
}}
EXP
STAT
O2
E−GEO
D−52445
O2
Node42 - Anr Activity
E−GEO
D−33160
O2
A
B
−15 0 10Value
Color KeyColor Key
−10 0 10Value
Color Key
Value−10 0 10
−10 0 15
Color Key
Value
}}Δanr
wt
}}Δanr
wt }}Δanr
wt
−5 0 5
Color Key
Value
Color Key
Value−4 0 4
Color Key
Value−2 0 2
Microarray RNAseq PAO1
RNAseq J215
C
ADAGE analysis of publicly available gene expression data collections illuminates Pseudomonas aeruginosa-host interactions��bioRxiv: http://dx.doi.org/10.1101/030650�github: http://github.com/greenelab/adage�Tan, Hammond, Hogan, and Greene. mSystems. 2016
I didn’t want to just know the names of things. I remember really wanting to know how it all worked.�- Elizabeth Blackburn
Image: US Embassy Sweden
Activity volcano plot
Fold Change
-log(p)
Pareto activity selection
Fold Change
-log(p)
volcano + networks =�pathway-style analysis
We can measure gene-gene similarity with ADAGE weights.
… …
ADAGE similarity captures functional similarity.
Tan et al. in prep.
ADAGE-based pathway analysis reveals underlying mechanisms
Tan et al. in prep.
Where do we have (enough) data?
Greene et al. Pac Sym Bio. 2016
ADAGE webserver coming soon! http://www.greenelab.com/webservers
Open science can be hypercollaborative.
Open Science != Reproducibility
Computational biology should be reproducible:
• Accurately
• Quickly
• Without contacting original authors
Continuous Analysis
Continuous Analysis
Cloud computing allows “Continuous Analysis” at scale.
Research Cluster
Semi-Supervised Learning of the Electronic Health Record for Phenotype Stratification �bioRxiv: http://dx.doi.org/10.1101/039800�github: http://github.com/greenelab/DAPS
Reproducible computational workflows with continuous analysis bioRxiv: http://dx.doi.org/10.1101/056473 github: http://github.com/greenelab/continuous_analysis
Research Parasite Awards (The “Parasites”)
Selection criteria for the work in question: • The awardee must not have been involved the design of the experiments
that generated the data.
• The awardee published independently of the original investigators, and the original investigators are not authors of the secondary analyses but are appropriately credited in the manuscripts.
• The awardee may have extended, replicated or disproved what the original investigators had posited.
• The awardee has provided source code and intermediate or final results in a manner that enhances reproducibility.
Research Parasite Awards (The “Parasites”)
Additional selection criteria for the Junior Parasite award: • The awardee must have published the work at the training stage of their
career (postdoctoral, graduate, or undergraduate). If the awardee has assumed a position as an independent investigator she or he should not have been in that position for more than 2 years.
• The award will be based on work described in a single manuscript (submitted alongside the nomination letter).
Research Parasite Awards (The “Parasites”)
Additional selection criteria for the Sustained Parasitism award: • The awardee must be in an independent investigator position in academia,
industry or public sector.
• The awardee must be a last or corresponding author on the three manuscripts submitted alongside the nomination letter.
• At least a five-year period must have elapsed between the publication of the first manuscript and the final manuscript.
Details • Submit by October 14, 2016 @ 5PM HST • Additional Instructions at�
http://greenelab.com/parasite-award • COI rules are strict! I can only talk about rules.
2017 Selection Committee:
Greene Lab: Jaclyn Taroni (Postdoc) Daniel Himmelstein (Postdoc) Jie Tan (Grad Student) Gregory Way (Grad Student) Brett Beaulieu-Jones (Grad Student) Amy Campbell (Postbacc) René Zelaya (Programmer) Matt Huyck (Programmer) Dongbo Hu (Programmer) Kathy Chen (Undergrad) Mulin Xiong (Undergrad) Tim Chang (Undergrad) Roshan Ravishankar (Undergrad)
Collaborators: Deb Hogan & Jack Hammond
Data: All investigators who publicly release their gene expression data.
Images: Artists who release their work under a Creative Commons license.
Funding: Gordon and Betty Moore Foundation National Science Foundation Cystic Fibrosis Foundation National Institutes of Health �
Find us online: http://www.greenelab.com Twitter: @GreeneScientist Calvin and Hobbes. Bill Watterson