games for improving human phenotype prediction
TRANSCRIPT
Games for improving human phenotype prediction
An important goal for biomedical research is to produce genetic and genomic predictors for human phenotypes such as disease prognosis or drug response. To this end, we can now quantify an extremely large number of potential biomarkers for any biological sample. In fact, a single sample could reasonably be described by millions of molecular variations in DNA, RNA, proteins, and metabolites. However, the actual number of samples processed typically remains small in comparison. As a result, attempts to use this data to build predictors often face problems of overfitting. (While a predictive pattern may describe training data very well, it may not reproduce well on other datasets.)
It has recently been shown that biological knowledge in the form of gene annotations and pathway databases can be used to guide the process of inferring phenotype predictors [1-3]. While promising, such methods are limited by the amount, quality and problem-specific applicability of the structured knowledge that is available.
Following in the line of games that have recently demonstrated success as a means of ‘crowdsourcing’ difficult biological problems [4,5], we are developing games with the purpose of improving human phenotype predictions. Our games work on two levels: (1) games such as Dizeez and GenESP collect novel gene annotations and (2) games like Combo engage players directly in the process of predictor inference.
Play game prototypes at:
ABSTRACT
Benjamin M Good, Salvatore Loguercio, Andrew I Su
The Scripps Research Institute, La Jolla, California, USA
REFERENCES
We acknowledge support from the National Institute of General Medical Sciences (GM089820 and GM083924) and the NIH through the FaceBase Consortium for a particular emphasis on craniofacial genes (DE-20057).
.
CONTACTBenjamin Good: [email protected] Salvatore Loguercio: [email protected] Andrew Su: [email protected]
1. Dutkowski and Ideker (2011) Protein Networks as Logic Functions in Development and Cancer. PLoS Computational Biology
2. Winter et al (2012) Google Goes Cancer: Improving Outcome Prediction for Cancer Patients by Network-Based Ranking of Marker Genes. PLoS Computational Biology
3. Liu et al (2012) Identifying dysregulated pathways in cancers from pathway interaction networks. BMC Bioinformatics
4. Good and Su (2011) Games with a Scientific Purpose. Genome Biology5. Kawrykow et al (2012) Phylo: A Citizen Science Approach for Improving Multiple Sequence
Alignment. PLoS One
ABSTRACT
FUNDING
Guess what genes your partner is thinking about when they
see ‘neuroblastoma’
http://www.genegames.org
A game board
GeneESP: gene – concept association with a partner
Game Objectives
Dizeez: gene – disease annotation quiz
Top associations provided four or more times and not found in
OMIM/PharmGKB.
Even after limited game playing, the Dizeez game resulted in the identification of several novel gene-disease annotations.
Preliminary Results713 games, 180 players;
Overall: 4,585 unique gene-disease assertions.
224 assertions provided more than once and not found in
OMIM/PharmGKB.
Improvements compared to Dizeez:• Reward new, useful annotations with points• Add social interaction• Enable gene-gene, gene-disease, gene-function
games on the same platform• Increase scalability of annotation collection (does
not depend on a database of ‘right’ answers)
Phenotype
gene pathway
gene
• Capture general community knowledge in a useful structure
• Concentrate community knowledge and reasoning around predicting a particular phenotype
Community
Combo: feature selection with community intelligence
Human Guided Forest
Ensemble classifier where components are decision trees constructed using manually selected subsets of features. Adaptation of Network Guided and Random Forests [1,2].
Select the disease related to the clue gene. Guess as many as you can in one minute.
Every guess adds weight to a link between a gene and a disease.
Phenotype 1
Phenotype 2
A hand
• Goal: pick the best set of genes• Best: the gene set that produces the best decision tree classifier• Classifier: created using training data and selected genes, used to
predict phenotype (e.g. breast cancer prognosis)
Inferred decision tree
Game Score: determined by estimating performance of trees constructed using the selected features on training data.
Score: 78 (percent correct)
Feature sets from many individual games used to create a Decision Tree Forest classifier. (Each tree votes once.)