wei wei - application of naive bayes model

University of Pittsburgh Department of Biomedical Informatics

The Application of Naive Bayes Model Averaging to Predict Alzheimer’s Disease

from Genome-Wide DataWei Wei, Shyam Visweswaran and Gregory F. Cooper

Motivation: Develop methods for using genome-wide information about an individual to inform clinical care

Background

• Genome-wide association studies (GWASs)• Single-nucleotide polymorphism (SNP)• High-throughput genotyping technologies

• Alzheimer’s disease (AD): • AD afflicts about 10% of persons over 65 and

almost half of those over 85• ~5.5 million cases currently in U.S.• 95% of all AD cases are Late-Onset AD (LOAD)

Background

• SourceTGEN dataset by Reiman et al *

• Cases• 1411 individuals• 861 LOAD and 550 controls

• SNPs• 312,316 SNPs• Two additional SNPs (rs429358 and rs7412)

genotyped separately (these determine APOE status)____________________________________________________________________

* Reiman E, Webster J, Myers A, Hardy J, Dunckley T, Zismann V, et al. GAB2 alleles modify Alzheimer's risk in APOE epsilon4 carriers. Neuron. 2007;54(5):713-20.

Background

• Bayesian Model Averaging• Represents uncertainty about the correctness of

any given model• Performs inference by weighting the prediction of

each model by our uncertainty in that model• Model-Averaged Naïve Bayes (MANB)

MANB efficiently averages over all naive Bayes models (on a given set of variables) in making a prediction for an individual patient case

Methods: Naive Bayes (NB)

SNP 1 SNP 2 SNP 3 …

LOAD

SNP 312318

Methods: Feature Selection Naive Bayes (FSNB)

Perform feature selection using a greedy, forward-stepping search that optimizes the prediction of LOAD

LOAD

SNP25,920

SNP 1,100

SNP104,582

SNP276,455

Methods: Model-Averaged Naive Bayes (MANB)

LOAD

SNP 1 SNP 2 SNP312,318

…

Methods: MANB

Model 1 Model i Model2312,318

……

312,3182

1

( | )

( | , ) ( | )i

P LOAD Ev

P LOAD Ev model i P model i training data

Methods: MANB• We can take advantage of the conditional independence

relationships in NB models to make it efficient to model average over all those many models.

• The computational “trick” is as follows*

• For each SNPi we construct a model-averaged conditional probability, PMANB(SNPi | LOAD), by averaging over whether or not there is an arc from LOAD to SNPi.

____________________________________________________________________* Dash D, Cooper G. Exact model averaging with naive Bayesian classifiers.

International Conference on Machine Learning (2002) 91 - 98.

This step can be viewed as a “soft” form of feature selection.

Methods: MANB• We can take advantage of the conditional independence

relationships in NB models to make it efficient to model average over all those many models.

• The computational “trick” is as follows*

• For each SNPi we construct a model-averaged conditional probability, PMANB(SNPi | LOAD), by averaging over whether or not there is an arc from LOAD to SNPi

• We use these model-averaged conditional probabilities to define a new NB model M over which we now perform NB inference.

• Performing inference with M is the same as model averaging over the exponential number of NB models discussed previously.

____________________________________________________________________* Dash D, Cooper G. Exact model averaging with naive Bayesian classifiers.

International Conference on Machine Learning (2002) 91 - 98.

Methods: Prior Probabilities

• Structure priors• FSNB and MANB assume each arc is present with some

probability p, independent of the status of other arcs in the model.

• Informed by the literature, we chose a value of p that yields an expected number of arcs of 20.

• Parameter priorsIf we think of P(SNPi |LOAD) as defining a table of probabilities, then we assume that every way of filling in that table (consistent with the axioms of probability) is equally likely a priori.

Methods: Experimental Design

• Five-fold cross-validation• Performance measures• Area under the ROC curve (AUC) as a measure of

discrimination• Calibration plots and Hosmer-Lemeshow goodness-of-

fit statistics• Run time

• Control algorithms• NB • FSNB

Results: Run time (in seconds)

Machine parameters: CPU 2.33 GHz, RAM 2 GB. Training time was the average over the five cross-validation folds. Time for loading data into memory is not included, but was about XYZ seconds.

MANB NB FSNB0

200400600800

10001200140016001800

16.1 15.6

1684.2

TrainingTime

TrainingTime

Results: Area under the ROC curve (AUC)

Discussion:• AUCs of FSNB and

MANB are similar (95% confidence interval of their AUC difference is -0.008 to 0.029). Their performance is strongly influenced by several APOE SNPs.

• AUCs of NB and MANB are strongly statistically different (p<0.00001).

Results: Calibration plot of NB

Discussion:

NB is poorly calibrated with almost all the test cases having probability predictions near 0 or 1. Such extreme predictions occur because there are such a large number of features in the model.

Results: Calibration plot of NB and FSNB

Discussion:

FSNB is the best calibrated algorithm among the three we evaluated. This result is likely due to the FSNB models containing only a few SNP features (< 4).

Results: Calibration plot of NB, FSNB and MANB

Discussion:

MANB is better calibrated than NB.

MANB is not as well calibrated as FSNB. We believe this result may be due to FSNB having such a small number of features in its models.

Summary of Results

NB FSNB MANB

AUC + +

Calibration ++ +

Run time ++ ++

Algorithm Availability

• A full description of the MANB algorithm is available in the appendix of our paper.

• It provides all the details needed to readily implement the algorithm.

Future Work Includes the Following

• Apply the MANB algorithm to additional datasets• Predict additional clinical outcomes• Use both genomic and clinical data to predict

clinical outcomes• Explore the use of additional genome-wide

measurement platforms, including next generation sequencing data

• Include additional control algorithms in future evaluations

Acknowledgement

• We thank Mr. Kevin Bui for his help in data preparation, software development, and the preparation of the appendix. We thank Dr. Pablo Hennings-Yeomans, Dr. Michael Barmada, and the other members of our research group for helpful discussions.

• The research reported here was funded by NLM grant R01-LM010020 and NSF grant IIS-0911032.

Thank you

Questions?

wei wei - application of naive bayes model

Documents

model methods

p model

model average

given model

new nb model

exact model averaging

prediction of load load

naive bayes nb load