computational challenges in biological data science: an optimistically cautionary tale
TRANSCRIPT
Community challenges in biological data science: an optimistically cautionary tale
Genomics
AssemblathonCAMISequence squeeze
Structural Biology
CASPCAPRIFold.it
Systems biology
sbv-IMPROVER
Function
CAFA
Text mining
BioCREATIVE
CACAO
Data providers
Programmers
Steering committee
Assessors
Assemble the Teams
DREAM
Prepare Challenge
Computational scientists enjoy developing new methods, and the community encourages them to do so. However, it is often confusing to know which method to choose: which method is best? Moreover, what does “best” mean?
To help choose an appropriate method for a particular task, scientists often form community-based challenges for the unbiased evaluation of methods in a given field. These challenges help evaluate existing and novel methods, while helping to coalesce a community and leading to new ideas and collaborations.
• Use more than one metric• Avoid redundant analyses
Identify social media, mailing lists, & other communicationvenues of your community
Publish a flagship paper& specialized “satellite”papers to maximize impact & credit
Agree on co-authorship &credits before challenge
Have a data sharing plan
Data & software sharing policy should be accepted by all
Lots of work for everyone: understand commitment
Advertise challenge
Run challenge
Analyze & score
Maintain database, website & code Hold a conference
Code & website Set challenge rules
Identifying a communityChoosing a clear question
Start by...
The Challenge
PublishSeek funding
Nurture the Challenge and your Community
Methods may overfit to win at challenge metrics rather than real-life problems
Risk-taking may be discouraged by “surefire” incremental additions to existing methods, rather than novel development
Methods improve due to challenges
Communities form, expand, and become more cohesive
Classroom educational value (CACAO, CAFA)
Citizen science value (Fold.it)
Data
Create incentives
Challenge Goal Notes
CASP Predicting protein structure from sequence Long running. Improved and quantified structure prediction methods
CAPRI Protein protein interaction
Fold.it Protein structure energy minima prediction game
Improved protein structure prediction via gameification
Assemblathon Genomic DNA sequence assembly A better understanding of assembly metrics and species-specific considerations
CAMI Metagenomic DNA assembly and analysis
sbv-IMPROVER Systems biology and precision medicine
CAMDA Large-scale systems biology problems
Sequence Squeeze Sequence data compression Finding ways to efficiently store large volumes of sequence data
DREAM Framework for many systems biology challenges
Provides a framework and easy setup for many challenges.
CACAO Improve function annotations in UniProt Competition between student teams teaches biocuration to college students
CAGI Predicting phenotypes from genetic variants Strong wetlab engagement
CAFA Predicting protein function Improve function prediction methods
BioCreative Evaluating text mining and information extraction systems
Funders may be tempted to judge methods primarily by challenge performance
Iddo Friedberg, Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA USA
Some Community Challenges in Life Science
Saez-Rodriguez J, Costello-JC, Friend SH et al Crowdsourcing biomedical research: leveraging communities as innovation engines (2016) Nature Reviews Genetics 17, 470–486
Friedberg I, Mooney SD and Radivojac P Ten Simple Rules for a Community Computational Challenge (2015) PloS Computational Biology 17, 470–486 (2016)
Further Reading
CAGI
CAMDA