games for improving human phenotype prediction

1
Games for improving human phenotype prediction An important goal for biomedical research is to produce genetic and genomic predictors for human phenotypes such as disease prognosis or drug response. To this end, we can now quantify an extremely large number of potential biomarkers for any biological sample. In fact, a single sample could reasonably be described by millions of molecular variations in DNA, RNA, proteins, and metabolites. However, the actual number of samples processed typically remains small in comparison. As a result, attempts to use this data to build predictors often face problems of overfitting. (While a predictive pattern may describe training data very well, it may not reproduce well on other datasets.) It has recently been shown that biological knowledge in the form of gene annotations and pathway databases can be used to guide the process of inferring phenotype predictors [1-3]. While promising, such methods are limited by the amount, quality and problem-specific applicability of the structured knowledge that is available. Following in the line of games that have recently demonstrated success as a means of ‘crowdsourcing’ difficult biological problems [4,5], we are developing games with the purpose of improving human phenotype predictions. Our games work on two levels: (1) games such as Dizeez and GenESP collect novel gene annotations and (2) games like Combo engage players directly in the process of predictor inference. Play game prototypes at: ABSTRACT Benjamin M Good, Salvatore Loguercio, Andrew I Su The Scripps Research Institute, La Jolla, California, USA REFERENCES We acknowledge support from the National Institute of General Medical Sciences (GM089820 and GM083924) and the NIH through the FaceBase Consortium for a particular emphasis on craniofacial genes (DE-20057). . CONTACT Benjamin Good: bgood@ scripps.edu Salvatore Loguercio: loguerci@ scripps.edu Andrew Su: [email protected] 1. Dutkowski and Ideker (2011) Protein Networks as Logic Functions in Development and Cancer. PLoS Computational Biology 2. Winter et al (2012) Google Goes Cancer: Improving Outcome Prediction for Cancer Patients by Network-Based Ranking of Marker Genes. PLoS Computational Biology 3. Liu et al (2012) Identifying dysregulated pathways in cancers from pathway interaction networks. BMC Bioinformatics 4. Good and Su (2011) Games with a Scientific Purpose. Genome Biology 5. Kawrykow et al (2012) Phylo: A Citizen Science Approach for Improving Multiple Sequence Alignment. PLoS One ABSTRACT FUNDING Guess what genes your partner is thinking about when they see ‘neuroblastoma’ http:// www.genegames.org A game board GeneESP: gene – concept association with a partner Game Objectives Dizeez: gene – disease annotation quiz Top associations provided four or more times and not found in OMIM/PharmGKB. Even after limited game playing, the Dizeez game resulted in the identification of several novel gene-disease annotations. Preliminary Results 713 games, 180 players; Overall: 4,585 unique gene-disease assertions. 224 assertions provided more than once and not found in OMIM/PharmGKB. Improvements compared to Dizeez: Reward new, useful annotations with points Add social interaction Enable gene-gene, gene-disease, gene-function games on the same platform Increase scalability of annotation collection (does not depend on a database of ‘right’ answers) Phenotype gene pathw ay gene Capture general community knowledge in a useful structure Concentrate community knowledge and reasoning around predicting a particular phenotype Community Combo: feature selection with community intelligence Human Guided Forest Ensemble classifier where components are decision trees constructed using manually selected subsets of features. Adaptation of Network Guided and Random Forests [1,2]. Select the disease related to the clue gene. Guess as many as you can in one minute. Every guess adds weight to a link between a gene and a disease. Phenotype 1 Phenotype 2 A hand Goal: pick the best set of genes Best: the gene set that produces the best decision tree classifier Classifier: created using training data and selected genes, used to predict phenotype (e.g. breast cancer prognosis) Inferred decision tree Game Score: determined by estimating performance of trees constructed using the selected features on training data. Score: 78 (percent correct) Feature sets from many individual games used to create a Decision Tree Forest classifier. (Each tree votes once.)

Upload: goodb

Post on 10-May-2015

529 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Games for improving human phenotype prediction

Games for improving human phenotype prediction

An important goal for biomedical research is to produce genetic and genomic predictors for human phenotypes such as disease prognosis or drug response. To this end, we can now quantify an extremely large number of potential biomarkers for any biological sample. In fact, a single sample could reasonably be described by millions of molecular variations in DNA, RNA, proteins, and metabolites. However, the actual number of samples processed typically remains small in comparison. As a result, attempts to use this data to build predictors often face problems of overfitting. (While a predictive pattern may describe training data very well, it may not reproduce well on other datasets.)

It has recently been shown that biological knowledge in the form of gene annotations and pathway databases can be used to guide the process of inferring phenotype predictors [1-3]. While promising, such methods are limited by the amount, quality and problem-specific applicability of the structured knowledge that is available.

Following in the line of games that have recently demonstrated success as a means of ‘crowdsourcing’ difficult biological problems [4,5], we are developing games with the purpose of improving human phenotype predictions. Our games work on two levels: (1) games such as Dizeez and GenESP collect novel gene annotations and (2) games like Combo engage players directly in the process of predictor inference.

Play game prototypes at:

ABSTRACT

Benjamin M Good, Salvatore Loguercio, Andrew I Su

The Scripps Research Institute, La Jolla, California, USA

REFERENCES

We acknowledge support from the National Institute of General Medical Sciences (GM089820 and GM083924) and the NIH through the FaceBase Consortium for a particular emphasis on craniofacial genes (DE-20057).

.

CONTACTBenjamin Good: [email protected] Salvatore Loguercio: [email protected] Andrew Su: [email protected]

1. Dutkowski and Ideker (2011) Protein Networks as Logic Functions in Development and Cancer. PLoS Computational Biology

2. Winter et al (2012) Google Goes Cancer: Improving Outcome Prediction for Cancer Patients by Network-Based Ranking of Marker Genes. PLoS Computational Biology

3. Liu et al (2012) Identifying dysregulated pathways in cancers from pathway interaction networks. BMC Bioinformatics

4. Good and Su (2011) Games with a Scientific Purpose. Genome Biology5. Kawrykow et al (2012) Phylo: A Citizen Science Approach for Improving Multiple Sequence

Alignment. PLoS One

ABSTRACT

FUNDING

Guess what genes your partner is thinking about when they

see ‘neuroblastoma’

http://www.genegames.org

A game board

GeneESP: gene – concept association with a partner

Game Objectives

Dizeez: gene – disease annotation quiz

Top associations provided four or more times and not found in

OMIM/PharmGKB.

Even after limited game playing, the Dizeez game resulted in the identification of several novel gene-disease annotations.

Preliminary Results713 games, 180 players;

Overall: 4,585 unique gene-disease assertions.

224 assertions provided more than once and not found in

OMIM/PharmGKB.

Improvements compared to Dizeez:• Reward new, useful annotations with points• Add social interaction• Enable gene-gene, gene-disease, gene-function

games on the same platform• Increase scalability of annotation collection (does

not depend on a database of ‘right’ answers)

Phenotype

gene pathway

gene

• Capture general community knowledge in a useful structure

• Concentrate community knowledge and reasoning around predicting a particular phenotype

Community

Combo: feature selection with community intelligence

Human Guided Forest

Ensemble classifier where components are decision trees constructed using manually selected subsets of features. Adaptation of Network Guided and Random Forests [1,2].

Select the disease related to the clue gene. Guess as many as you can in one minute.

Every guess adds weight to a link between a gene and a disease.

Phenotype 1

Phenotype 2

A hand

• Goal: pick the best set of genes• Best: the gene set that produces the best decision tree classifier• Classifier: created using training data and selected genes, used to

predict phenotype (e.g. breast cancer prognosis)

Inferred decision tree

Game Score: determined by estimating performance of trees constructed using the selected features on training data.

Score: 78 (percent correct)

Feature sets from many individual games used to create a Decision Tree Forest classifier. (Each tree votes once.)