a support vector machine approach for detecting gene-gene interaction

16
Genetic Epidemiology 32 : 152–167 (2008) A Support Vector Machine Approach for Detecting Gene-Gene Interaction Shyh-Huei Chen, 1 Jielin Sun, 2 Latchezar Dimitrov, 2 Aubrey R. Turner, 2 Tamara S. Adams, 2 Deborah A. Meyers, 2 Bao-Li Chang, 2 S. Lilly Zheng, 2 Henrik Gro ¨ nberg, 4 Jianfeng Xu, 2 and Fang-Chi Hsu 3 1 Department of Industrial Management, National Yunlin University of Science and Technology, Yunlin, Taiwan 2 Center for Human Genomics, Wake Forest University School of Medicine, Winston-Salem, North Carolina 3 Department of Biostatistical Sciences, Wake Forest University School of Medicine, Winston-Salem, North Carolina 4 Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden Although genetic factors play an important role in most human diseases, multiple genes or genes and environmental factors may influence individual risk. In order to understand the underlying biological mechanisms of complex diseases, it is important to understand the complex relationships that control the process. In this paper, we consider different perspectives, from each optimization, complexity analysis, and algorithmic design, which allows us to describe a reasonable and applicable computational framework for detecting gene-gene interactions. Accordingly, support vector machine and combinatorial optimization techniques (local search and genetic algorithm) were tailored to fit within this framework. Although the proposed approach is computationally expensive, our results indicate this is a promising tool for the identification and characterization of high order gene-gene and gene-environment interactions. We have demonstrated several advantages of this method, including the strong power for classification, less concern for overfitting, and the ability to handle unbalanced data and achieve more stable models. We would like to make the support vector machine and combinatorial optimization techniques more accessible to genetic epidemiologists, and to promote the use and extension of these powerful approaches. Genet. Epidemiol. 32:152–167, 2008. r 2007 Wiley-Liss, Inc. Key words: association analysis; data mining; gene-gene interaction; SNPs; support vector machine Contract grant sponsor: Swedish Cancer Foundation; Contract grant sponsor: NCI; Contract grant number: CA1R01CA105055-01A1. Correspondence to: Dr. Jianfeng Xu, Center for Human Genomics, Wake Forest University School of Medicine, Medical Center Blvd, Winston-Salem, NC 27157. E-mail: [email protected] Received 17 May 2007; Revised 2 August 2007; Accepted 14 September 2007 Published online 29 October 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/gepi.20272 INTRODUCTION Although genetic factors play an important role in many human diseases, multiple genes or genes and environmental factors may ultimately influence individual risk to these diseases [Cordell and Clayton, 2005]. Currently, the primary statistical approach used to identify causal genes is the conventional method of single gene or single- nucleotide polymorphism (SNP) association analy- sis. While single gene association analysis is a simple and well-established approach, it is unlikely to identify interactions between multiple genes when the number of genes increases dramatically, and can thus decrease the power to detect associations. There is now a growing need for methods that are able to tackle gene-gene interactions, especially with the availability of high-throughput genotyping and with advances in computation. Velez et al. [2007] provided a detailed overview of the definition of gene-gene interaction (epistasis) from both the statistical and biological viewpoints. It is important to understand the relationship between the statistical and biological epistasis if we want to make biological inferences from statistical results [Moore and Williams, 2005]. From the statistical viewpoint, we define gene-gene interaction in this paper as genotypic combinations of SNPs that are associated with disease status. The association arising from the interaction could be nonlinear. A summary of analytical tools for studying gene- gene or gene-environment interaction has been described in Thornton-Wells et al. [2004]. Due to the computational complexity of gene-gene interac- tion modeling and detection, traditional statistical tools are not appropriate for analyzing large-scale genetic data. However, it appears some of the computational limitations of detecting gene-gene r 2007 Wiley-Liss, Inc.

Upload: shyh-huei-chen

Post on 11-Jun-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: A support vector machine approach for detecting gene-gene interaction

Genetic Epidemiology 32 : 152–167 (2008)

A Support Vector Machine Approach for DetectingGene-Gene Interaction

Shyh-Huei Chen,1 Jielin Sun,2 Latchezar Dimitrov,2 Aubrey R. Turner,2 Tamara S. Adams,2

Deborah A. Meyers,2 Bao-Li Chang,2 S. Lilly Zheng,2 Henrik Gronberg,4 Jianfeng Xu,2� and Fang-Chi Hsu3

1Department of Industrial Management, National Yunlin University of Science and Technology, Yunlin, Taiwan2Center for Human Genomics, Wake Forest University School of Medicine, Winston-Salem, North Carolina

3Department of Biostatistical Sciences, Wake Forest University School of Medicine, Winston-Salem, North Carolina4Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden

Although genetic factors play an important role in most human diseases, multiple genes or genes and environmental factorsmay influence individual risk. In order to understand the underlying biological mechanisms of complex diseases, it isimportant to understand the complex relationships that control the process. In this paper, we consider differentperspectives, from each optimization, complexity analysis, and algorithmic design, which allows us to describe a reasonableand applicable computational framework for detecting gene-gene interactions. Accordingly, support vector machine andcombinatorial optimization techniques (local search and genetic algorithm) were tailored to fit within this framework.Although the proposed approach is computationally expensive, our results indicate this is a promising tool for theidentification and characterization of high order gene-gene and gene-environment interactions. We have demonstratedseveral advantages of this method, including the strong power for classification, less concern for overfitting, and the abilityto handle unbalanced data and achieve more stable models. We would like to make the support vector machine andcombinatorial optimization techniques more accessible to genetic epidemiologists, and to promote the use and extension ofthese powerful approaches. Genet. Epidemiol. 32:152–167, 2008. r 2007 Wiley-Liss, Inc.

Key words: association analysis; data mining; gene-gene interaction; SNPs; support vector machine

Contract grant sponsor: Swedish Cancer Foundation; Contract grant sponsor: NCI; Contract grant number: CA1R01CA105055-01A1.�Correspondence to: Dr. Jianfeng Xu, Center for Human Genomics, Wake Forest University School of Medicine, Medical Center Blvd,Winston-Salem, NC 27157. E-mail: [email protected] 17 May 2007; Revised 2 August 2007; Accepted 14 September 2007Published online 29 October 2007 in Wiley InterScience (www.interscience.wiley.com).DOI: 10.1002/gepi.20272

INTRODUCTION

Although genetic factors play an important role inmany human diseases, multiple genes or genes andenvironmental factors may ultimately influenceindividual risk to these diseases [Cordell andClayton, 2005]. Currently, the primary statisticalapproach used to identify causal genes is theconventional method of single gene or single-nucleotide polymorphism (SNP) association analy-sis. While single gene association analysis is a simpleand well-established approach, it is unlikely toidentify interactions between multiple genes whenthe number of genes increases dramatically, and canthus decrease the power to detect associations. Thereis now a growing need for methods that are able totackle gene-gene interactions, especially with theavailability of high-throughput genotyping and withadvances in computation.

Velez et al. [2007] provided a detailed overview ofthe definition of gene-gene interaction (epistasis)from both the statistical and biological viewpoints. Itis important to understand the relationship betweenthe statistical and biological epistasis if we want tomake biological inferences from statistical results[Moore and Williams, 2005]. From the statisticalviewpoint, we define gene-gene interaction in thispaper as genotypic combinations of SNPs that areassociated with disease status. The associationarising from the interaction could be nonlinear.A summary of analytical tools for studying gene-gene or gene-environment interaction has beendescribed in Thornton-Wells et al. [2004]. Due tothe computational complexity of gene-gene interac-tion modeling and detection, traditional statisticaltools are not appropriate for analyzing large-scalegenetic data. However, it appears some of thecomputational limitations of detecting gene-gene

r 2007 Wiley-Liss, Inc.

Page 2: A support vector machine approach for detecting gene-gene interaction

interaction can be overcome using modern techni-ques, such as machine learning and data mining.Machine learning (supervised learning) predictorsor classifiers, such as classification and regressiontree (CART), artificial neural networks (ANNs), andsupport vector machines (SVMs), are suitablebecause they can be used to assess the rules forclassifying individuals into case and control groups[Alpaydin, 2004], which is one of the major goals ofdetecting interaction among multiple genes. Itshould be noted that an ideal predictor should beable to extract the most useful information from anentire dataset and avoid over fitting during thetraining (learning) process, in order to ultimatelyproduce the highest possible prediction accuracyrates (better than random guesses) for both classes intesting (prediction). With an applicable predictor onhand, the problem of detecting interactions amongmultiple genes can be considered as a combinatorialoptimization problem: finding the best combinationof SNPs from a given dataset which can produce thehighest prediction accuracy when using the applic-able predictor. In this case, the applicable predictorserves as the objective function of a combinatorialoptimization problem. Thus, the use of an applicablepredictor together with a tailored optimizationsearch technique might provide a promisingapproach for solving this problem.

There are two major aims of this paper. We firstintend to provide a brief review of several datamining approaches that have been adopted for thedetection of gene-gene interactions, especially formethods dealing with binary trait outcomes, such asdisease status, while also addressing the correspond-ing computational and algorithmic issues. Second,we will present an approach for detecting gene-geneinteraction that seems poised to resolve those issues.We will tackle this problem by testing a newlyproposed approach within a simulated data and areal prostate cancer genotyping dataset to illustratethe validity of the approach, and then comparing ourresults to a benchmark analysis based on multifactor-dimensionality reduction [MDR; Ritchie et al., 2001].

Classical decision tree approaches, such as CART,have been adopted in previous studies that havedemonstrated its efficiency and power in detectinggene-gene interaction [Cook et al., 2004]. However,in order to maintain computational efficiency, CARTuses a greedy strategy in tree building/searching inwhich it chooses the locally best discriminatoryfeature at each stage, and as a result the outputclassification tree is determined in the first stage ofthe analysis (the feature selected in the root node).Thus, in most cases, CART is not able to produce anoptimal decision tree. Several other approachesbased on the classifier-optimization framework havebeen designed and applied in empirical studies,

including MDR [Ritchie et al., 2001], genetic pro-graming neural network [GPNN; Ritchie et al.,2003b], logic regression with simulated annealing[Ruczinski et al., 2003] or Markov chain monte carlo[Kooperberg and Ruczinski, 2005], and SVM withrecursive feature addition [Listgarten et al., 2004] orRecursive Feature Elimination [Guyon et al., 2002].These methods are discussed in the following.

MDR is a modification of the combinatorial-partitioning method [Nelson et al., 2001] that isspecifically designed (originally) for handlingbalanced genotypic data. For a given set of SNPs,MDR partitions all individuals according to theircombinations of genotypes, and the ratio of geno-types among cases versus controls is then used togroup all of the partitions into two classes, either highrisk (ratioZ1) or low risk (ratioo1). In this case, thisgrouping works as a classifier. In order to determinethe best combination of SNPs, MDR utilizes anexhaustive search from one SNP to pre-assignedn SNP combinations. MDR is an intuitive and simpleapproach that has been successful at finding gene-gene interactions [Thornton-Wells et al., 2004].However, MDR suffers from the curse of dimension-ality due to the exhaustive search, and there is notheoretical (mathematical or statistical) support forthe idea that such a classification (partitioning andgrouping) is applicable. Furthermore, the influence ofunbalanced data and missing values on the partition-ing and grouping in MDR is not clear, and this issuehas not been theoretically addressed.

ANNs have been shown to be universal approx-imators [Hornik et al., 1989] and have been shown tohave excellent power for performing pattern recog-nition and classification, optimization, and functionapproximation in various engineering and socialscience applications. Basically, the ANN trainingprocess is an unconstrained nonlinear optimizationproblem, hence the existence of local optima couldbe obstacles for finding the global optimum. For agiven ANN architecture, it is common that severaltrial-and-error settings in both the user-definedparameters and the randomized initial solution (linkweights and biases) of the corresponding nonlinearprograming are needed in order to achieve a well-trained ANN predictor. In order to approach thegene-gene interaction problem, ANN should serveas the objective function of the SNP combinatorialoptimization problem. In such cases, this ANNshould be a one-to-one function, which means for agiven dataset with a fixed feeding data sequence, thetraining (or testing) output of the ANN must be thesame, either in mean square error or prediction error.(However, when the ANN output has the samemean square error, this does not mean the ANNpossesses the same prediction error.) In order toobtain fixed results, the parameters and the initial

153Support Vector Machine for Gene-Gene Interaction

Genet. Epidemiol. DOI 10.1002/gepi

Page 3: A support vector machine approach for detecting gene-gene interaction

solution must be fixed. However, this forced para-meter setting is ineffective for training an ANN,whether or not its architecture is optimal. As a result,ANNs should not be used as a predictor for thedetection of gene-gene interaction under the classi-fier-optimization framework. ANNs have also beenused to detect the importance of SNPs by utilizingthe so-called ‘‘contribution value’’, which is calcu-lated based upon the weights in the trained network,of each individual SNP under study [Lucek and Ott,1997]. In such cases, all or most of the SNPs areincluded when training an ANN. However, thecontribution value does not have a theoretical resultthat can be used to represent the influence of eachindividual gene on the studied disease. Furthermore,in practical experiments, the contribution values andtheir ranks vary quite considerably from run to run.As a result, using the contribution values from atrained ANN as SNP selection criteria does notappear to be a reasonable approach. GPNN [Ritchieet al., 2003b] is another approach that uses ANN tomodel and detect gene-gene interactions. However,it is not clear what type of interaction was detectedand how the authors performed the feature (SNP)selection using GPNN. Moreover, GPNN is alsoexpected to encounter all of the same types ofpractical issues as ANN.

Logic regression (LG) [Ruczinski et al., 2003] is ageneralized regression method specially designedfor handling binary covariates. Let X1, y, Xk bebinary predictors and Y be a response variable, LGcan be written as a regression from, and gðE½Y�Þ ¼b0 þ

Ptj¼1 bjLj; where bj; j ¼ 0; . . . ; t; are parameters

and Lj, j 5 1, y, t, are Boolean expressions of somepredictors Xi. A score function is defined to reflectthe quality of the model under consideration, andcan also be used as a ranking criterion whensearching for the best model. The predictors (SNPs)involved in the logic expression are the best SNPcombination in modeling gene-gene interaction. Asthe regression form is not continuous and is notdifferentiable, classical optimization techniques can-not be performed to find the optimal model. As aresult, LG can only utilize heuristics (simulatedannealing or Monte Carlo) to find optimal logicexpression of Boolean operations. Compared toCART and MDR, LG should be more comprehensivein model building and searching. However, due tothe manipulation complexity of logic expressionsand the process of searching for the best parametersbj, LG is computationally expensive.

In addition to the selection of an applicableclassifier, there are two other computational issuesthat should be addressed in designing searchalgorithms for the identification of interactions: thenumber of SNPs involved in modeling the interac-tion (which is unknown in advance) and the total

combinatorial number of involved SNPs (usuallyincreases exponentially). Due to an exponentialgrowth in the number of analyses for combinationsof SNPs, exact search methods, such as the one usedin MDR, must restrict the region of the searchingsolution within a few SNPs. In other words, exactalgorithms for finding the optimal combination ofSNPs are not appropriate when a higher orderinteraction exists. Thus, search techniques based ona heuristic or meta-heuristic should be consideredfor solving this kind of problem. Note that if weconsider the number of SNPs involved in forminginteractions as a variable, such as in LG, then thisvariable might lead the search algorithm to produceundesirable results. For instance, consider a searchprocess in which the solution first consists of fourSNPs with an accuracy of 60%, but then a search forthe best neighbor solution identifies a solution thatconsists of five SNPs with an accuracy of 60.1%. Inthis case, most search algorithms will take the fiveSNPs combination for the next solution because ofthe higher accuracy. However, this increase is notsurprising because as more SNPs are involved in aninteraction, the accuracy may generally be expectedto increase (as shown in Fig. 4). Simply put, this isnot a fair comparison of accuracy. Furthermore,suppose that the above results happened to be thefinal results of a search algorithm and there is noadditional information available, such as the statis-tical significance of the differences between the twoaccuracy rates or knowledge from molecular biology,to support the finding which suggests that theinteraction among the five SNPs is stronger than thatamong the four SNPs. If we then use the Occam’sRazor principle, which states that given two equallypredictive theories the simplest answer is usually thecorrect answer, then the group of four SNPs shouldbe selected as the best choice. Furthermore, from theaspect of algorithmic design for an optimizationproblem, representations of fixed length solutionsand their corresponding neighborhood structureshould be much easier to create and maintain thanthe use of a variable one. Consequently, the optimalapproach is to fix the number of SNPs considered inestablishing interactions. Due to the extensive natureof the LG search, the varying number of SNPs thatare compared, and the fact that the correspondingneighborhood structure is very complicated, the LGsearch is thought to be computationally expensiveand ineffective (contrary to the Occam’s Razorprinciple).

SVM is a state-of-the-art classification techniqueand has been benchmarked against other classifiersin several studies in different research fields[Cristianini and Shawe-Taylor, 2000]. Although afew studies [Guyon et al., 2002; Listgarten et al.,2004; Schwender et al., 2004] have shown that SVMs

154 Chen et al.

Genet. Epidemiol. DOI 10.1002/gepi

Page 4: A support vector machine approach for detecting gene-gene interaction

are promising predictors for the detection of gene-gene interactions, several aforementioned issuesencountered by other classifiers have to be consid-ered in order to tailor SVM into an applicablepredictor. Note that the data sample sizes (analyzedby SVMs) in these studies are relatively small [72 inGuyon et al., 137 in Listgarten et al., and 437 inSchwender et al.], and it is not clear whether theunmodified SVM can be applied directly to largerdatasets without first considering the influence ofunbalanced data and missing values on the predic-tion accuracy. Furthermore, both Listgarten et al.and Guyon et al. adopted similar greedy searchtechniques to find the best SNP combination[no SNP combination considered in Schwenderet al., 2004], and they were only able to produce avery suboptimal solution. In this study, aSVM approach is proposed for the detection ofgene-gene interaction, and to address the aforemen-tioned issues including parameter settingin predictors, the design of the search algorithm,and data processing (unbalanced data and missingvalues).

METHODS

SUPPORT VECTOR MACHINE

Given an input-output (features-classes) of paireddata ðxi; yiÞ; xi 2 Rn; yi 2 f�1; 1g; i ¼ 1; � � � ; l, thetraining process of a (linear) support vector machineaims to find a linear separating hyperplane wTxþb ¼ 0 with the maximal margin (2jjwjj�1, the distancebetween wTx1b 5�1 and wTx1b 5 1) under theclassification conditions:

wTxi þ b � �1; if yi ¼ �1;wTxi þ b � 1; if yi ¼ 1:

�ð1Þ

If such a hyperplane exists, fðxÞ ¼ signðwTxþ bÞis called the decision function of SVM, and sometraining vectors xi’s are called support vectors ifthey are located on their corresponding decisionregion boundary (wTx1b 5�1 or 1). It shouldbe noted that ANN and other classifiers are trainedto minimize the prediction errors; in contrast,SVM aims to maximize the prediction accuracy(maximal decision boundary). A geometric descrip-tion of linear SVM is shown in the left panel ofFigure 1.

However, it is usually not possible to achievean errorless linear separation. Therefore,some training errors must be allowed in theformulation in order to obtain a feasible solutionfor the system (1). The error terms, xi’s, areintroduced into the system (1) by modifying theseinto yiðw

Txi þ bÞ � 1� xi and penalizing the objec-tive function with C

Pli¼1 xi , where xi’s (soft-margin)

are the distances between those misclassified pat-terns with their corresponding boundary, and C is apenalized parameter (as shown in the right panel ofFigure 1).

If the decision function of a classificationproblem is a nonlinear function of arbitrarycomplexity, then a linear SVM is not sufficientfor performing classification. Note that a complexpattern classification problem that is cast nonli-nearly in a high-dimensional space is more likelyto be linearly separable than in a low-dimensionalspace [Cover’s theorem, 1965]. In other words,if training vectors, xi’s, are mapped into a higher(4n, maybe infinite) dimensional space by anonlinear mapping f, then a linearly separatinghyperplane is more likely to be found in thishigher dimensional space (as shown in Fig. 2).Together with error terms and nonlinear mapping,SVM training can generally be formulated to

Fig. 1. A hyperplane (wTx1b 5 0) is identified by support vector machine in order to discriminate circles and squares with a maximal

margin 2=jjwjj. The left panel is linearly separable while the right one is not. Vectors x1, x2, x3, x4, x5, and x6 are all support vectors.

155Support Vector Machine for Gene-Gene Interaction

Genet. Epidemiol. DOI 10.1002/gepi

Page 5: A support vector machine approach for detecting gene-gene interaction

find the optimal solution of the following quadraticprograming:

minw;b;x

1

2wTwþ C

Xl

i¼1

xi

s:t: yiðwTfðxiÞ þ bÞ � 1� xi; xi � 0; i ¼ 1; . . . ; l:

ð2Þ

It should be noticed that every local solution of(2) is also global. There is no analytical method todetermine an appropriate f which can be used in (2)to provide sufficient discrimination power in ahigher dimension space. However, this problemcan be solved by its Lagrange dual and thecorresponding optimality conditions [Cristianiniand Shawe-Taylor, 2000]. Furthermore, in thisdual approach, the precise expression of thenonlinear mapping f is not necessary; instead apositive semi definite kernel function, Kðxi; xjÞ

ð¼ fðxiÞTfðxjÞÞ; is needed. Some commonly used

kernel functions in SVM are Kðxi; xjÞ ¼ xTi xj (linear

kernel), Kðxi; xjÞ ¼ ðgxTi xj þ rÞd (polynomial kernel),

and Kðxi; xjÞ ¼ exp �g xi � xj

�� ��2� �

(radial basis func-

tion), where r, d, and g40 are kernel parameters.In general, training vectors with a correspondingpositive Lagrange multiplier are defined as supportvectors.

We have observed that the linear kernel is a specialcase of radial basis function [Keerthi and Lin, 2003]and there may be some numerical difficulty (ill-conditional matrices) in using polynomial kernelswith a higher order. Furthermore, our pilot compu-tational experiments also demonstrate and confirmthis in similar situations. As a result, we limitourselves to use of the radial basis function kernel.

It should be noted that the larger margin obtainedby solving (2) does not mean the SVM possesseshigher prediction accuracy. Moreover, bias ofprediction accuracy could occur if the data areunbalanced. In order to obtain a higher predictionaccuracy and also control the bias of prediction

accuracy, a penalized SVM formulation is used tobalance accuracy from both classes:

minw;b;x

1

2wTwþ Cþ

Xyi¼1

xi þ C�X

yi¼�1

xi

s:t: yiðwTfðxiÞ þ bÞ � 1� xi xi � 0; i ¼ 1; . . . ; l;

ð3Þ

where C� and C1 are penalty parameters for the sizeof the soft margin in classes (�) (i.e., control) and(1) (i.e., case), respectively. However, there is noanalytical way to determine the values of C� andC1 beforehand. In this paper, a simple bisectionsearch is proposed for finding appropriate penaltyparameters. First, C�5 C is fixed, then we calculateC1 5 (] Controls/] Cases) C from the correspondingnon-missing data subsets. If such C1 can producereasonable training accuracy for both classes, C1 isaccepted. Otherwise, C1 is used as a middle point ofan interval [0, 2C1] and a bisection search isperformed in this interval to determine the value ofC1. If no such C1 exists and the accuracy rate of anyone class is less than 50%, then the output accuracywill be marked as unacceptable. These indices can beused as an indicator to penalize the undesirablesolutions while designing the search algorithm. Inthis paper, the accuracy difference between twoclasses is controlled, such that it must be less than5%, in order to ensure a reasonable predictor.

We have implemented SVM in this manner in orderto maintain the same computational procedure withthe same comparison basis of average accuracy ofprediction, because there are several user-definedparameters in SVM which will affect the training/testing output. The average accuracy for v-fold cross-validation will be fixed only if we keep the same datafeeding sequence while performing the SVM train-ing/testing process. Thus, SVM can serve as anobjective function adopted in different search techni-ques, such as exact search (exhaustive search), localsearch (greedy search), and meta-heuristic (geneticalgorithms), in order to find the best combination ofSNPs. However, there are two disadvantages in thiscomputational procedure. Basically, SVM trainingaims to solve a corresponding quadratic programingproblem. Usually the computational complexity ismuch higher than that of decision tree training orMDR. It should also be noted that some additionalSVM trainings are needed in order to correct the biasof prediction accuracy. As a result, the proposedprocedure could be very computationally expensive.Another disadvantage is that because C� is forced tobe one constant, the accuracy produced by the SVMcould be a suboptimal. In order to resolve thisproblem, a grid search for SVM parameters (C, g)can be performed using some promising SNPcombinations to refine the results.

φ

Fig. 2. Support vector machine performs linear separation in thehigher dimensional space associated with /. The resulting

linear separation can be nonlinear in the original space.

156 Chen et al.

Genet. Epidemiol. DOI 10.1002/gepi

Page 6: A support vector machine approach for detecting gene-gene interaction

SEARCH ALGORITHMS

In this study, four search algorithms to detectinteractions among multiple SNPs are proposed:recursive feature addition SVM (SVM-RFA), recur-sive feature elimination SVM (SVM-RFE), SVMwith local search (SVM-Local), and SVM withgenetic algorithm (SVM-GA). A library for SVM(LIBSVM) developed by Chang and Lin [2005] wasadopted as the SVM engine (http://www.csie.ntu.edu.tw/cjlin/libsvm), and the proposed approachesare implemented in MATLABs. GA is selectedas one of the search engines because it is a

popular meta-heuristic that can perform globalsearches very effectively and efficiently, plus thereis no information from the data under investigationwhich can be utilized to perform an intelligentsearch. However, GAs do not have the abilityto perform an in-depth local search, and a localsearch is needed to ensure better solutions. Localsearch is a very simple and efficient search methodwhich is guaranteed to produce a local optimum.However, in order to obtain a near-optimal solution,a considerable number of runs of the local searchmust be executed. SVM-RFA and SVM-RFE aresimilar to the greedy search techniques used inListgarten et al. [2004] and Guyon et al. [2002].We have implemented these for the purpose ofbenchmarking.

To illustrate the algorithms, several notationsare denoted as follows: m represents the total numberof SNPs in the dataset under investigation; n isthe number of SNPs involved in an interaction;M is an index set that contains all SNPs ðM ¼1; 2; � � � ;mf gÞ; S is a subset of M containing n SNPsS ¼ s1; s2; . . . ; snf g �Mð Þ; and F(S) is the average

(testing) prediction accuracy (number of cases classi-fied as case 1 number of controls classified as control)/(total number of cases 1 number of controls) of v-foldcross-validation SVM considering S.

SVM-RFE and SVM-RFA. RFE/A uses featureranking criteria to define a nested subset of featureS1 � S2 � . . . � Sm , then model selection is used to

select an optimal subset of features. RFE [Guyonet al., 2002] eliminates one or more genes at eachiteration, using correlation coefficients w2

i ¼ ðmiþ �

mi�=siþ � si�Þ2 as the feature (gene) ranking, where

mi and si are mean and standard deviations of gene ifor each class. The information gained for each SNPis used as the ranking criterion for recursive additionof SNPs [Listgarten et al., 2004]. Because we areinterested in obtaining high prediction accuracy, weused changes in average accuracy of the predictionswhen one SNP was removed/added as the rankingcriterion. SVM-RFE and SVM-RFA are outlined asfollows:

SVM with Local Search. The local search meth-od, conventionally, is an iterative procedure thatperforms a neighborhood search. In each step, forthe current solution S�(randomly selected in theinitialization), a search procedure is applied toidentify the best solution �S in the neighborhood ofS�, N(S�), for an associated set of neighbor solutionsof S�. If the solution �S is better than S�, move S� to �S.Then repeat the neighborhood search for �S. Other-wise, the local search is finished and terminated atsolution S�. In our implementation, the neighborhoodof S� is defined as the set NðS� ¼ fs�1; s

�2; . . . ; s

�ngÞ

S ¼ fs1; s2; . . . ; sng : S \ S�j j ¼ n� 1f g , i.e., if S is aneighbor of S� then there is only one element that isdifferent between S� and S. In this case, the size ofthis neighborhood is jNðSÞj ¼ nðm� nÞ: Local searchmethods can finish the searching procedure in arelatively small number of iterations; however, theyoften stop at a local optimum.

SVM-Local algorithm

1: initial S�; evaluate FðS�Þ

2: repeat

�S ¼ arg maxS2NðS�Þ

FðSÞ

if Fð �SÞ4FðS�Þ then S� �S

otherwise; exit repeat

3: output S� and FðS�Þ

SVM-RFE algorithm

1:n m; Sn ¼M:

2: repeat until n ¼ 1

i� ¼ arg maxi2Sn

FðSn=figÞ

Sn�1 ¼ Sn=fi�g

n n� 1

3: output Sn and FðSnÞ; n ¼ 1; . . . ;m:

SVM-RFA algorithm

1: n 0; S0 ¼+:

2: repeat until n ¼ m

i� ¼ arg maxi2N=Sn

Fði [ SnÞ

Snþ1 ¼ Sn [ fi�g

n nþ 1

3: output Sn and FðSnÞ; n ¼ 1; . . . ;m:

157Support Vector Machine for Gene-Gene Interaction

Genet. Epidemiol. DOI 10.1002/gepi

Page 7: A support vector machine approach for detecting gene-gene interaction

SVM with Genetic Algorithm. GAs are stochas-tic search techniques based on the mechanism ofnatural selection and genetics, and which aim toimitate living beings in order to solve difficultoptimization problems with high complexity andundesirable structure. Different from traditionalpoint-to-point search techniques, such as localsearch, a GA starts from one set of random solutionscalled the population (P). Each individual solution inthe population is called a chromosome (S). At eachgeneration (t), the GA performs genetic operations,crossover, and mutation, on randomly selected

chromosomes to yield offspring (C) and producethe next generation. Within each generation, anevolution operation, called the selection, is appliedto the chromosomes of parents and offspringin order to evolve these into a subset with betterfitness (F(S)). From generation to generation, thechromosomes in the population will eventually beconvergent. The best chromosome is found. As GAsare population-based derivative-free multiple-direc-tional search techniques, they are admitted for theirrobustness and efficiency. In the proposed SVM-GAimplementation, an ordered set of S is used as thesolution representation where the phenotype andgenotype of this GA are the same, where the one-cut-point OX (order) crossover operation and one-genemutation method with a necessary repairing processare used as genetic operations, and where a roulettewheel selection together with a fitness scalingmethod is used as the evolution operation [Genand Cheng, 2000]. The adopted fitness scalingmethod can be formulated as

L ¼ minS2PðFðSÞÞ � 1;

�FðSÞ ðFðSÞ � LÞd;

(

where dZ1 is a power factor used to distinguishthe fitness difference of chromosomes and �F is thescaled fitness used in roulette wheel selection. Anelite set E is used to collect the best chromosome for

each generation. We define a premature GA asfollows: the population is dominated by a singlechromosome (for example, more than 80%) amongsuccessive generations (for example, five genera-tions). When a premature condition is detected inGA searching, a new population is created bydrawing the solutions from the elite set E. Somepromising solutions can be recombined to yield adifferent population to prevent a premature GA. AsGAs do not have the ability to perform an in-depthlocal search, SVM-Local will be applied to the finalresults of GA to ensure better solutions.

SIMULATIONS

We used the simulated data generated by Ritchieet al. [2003a] to evaluate the power of the proposedSVM approach for detecting gene-gene interactions.The simulated data can be downloaded from thefollowing website: http://chgr.mc.vanderbilt.edu/ritchielab/method.php?method 5 mdr. Briefly, eachdata contains 10 SNPs and is balanced with 200 casesand 200 controls. Six different two-locus epistasismodels were considered. All of the models weregenerated in the absence of any main effects. In thefirst model, a high risk of disease is dependent oninheriting a heterozygous genotype from one locusor a heterozygous genotype from a second locus, butnot both. In the second model, the high risk ofdisease is dependent on inheriting exactly two high-risk alleles from two different loci. The remainingfour models were generated with the use of theepistasis model discovery method developed byMoore et al. [2002], using allele frequencies ofp 5 0.25 and q 5 0.75 for models 3 and 4, and allelefrequencies of p 5 0.1 and q 5 0.9 for models 5 and 6.They developed a genetic algorithm approach todiscovering complex genetic models where twoSNPs influence disease risk only through nonlinearinteractions. The penetrance functions and allelefrequencies for the SNPs can be found in Figure 2 inRitchie et al. [2003a]. Under each model, the data

SVM-GA algorithm

1: t 0; initialize PðtÞ; evaluate PðtÞ

2:while ðnot termination conditionÞdo

apply crossover and mutation to PðtÞ to yield CðtÞ and evaluate CðtÞ

execute roulette wheel selection to produce Pðtþ 1Þ from PðtÞ and CðtÞ

E ¼ E [ argmaxS2PðtÞ[CðtÞ

FðSÞ

if GA is premature then generate Pðtþ 1Þ from E

t tþ 1

3:perform SVM-Local to some promising solutions in E

158 Chen et al.

Genet. Epidemiol. DOI 10.1002/gepi

Page 8: A support vector machine approach for detecting gene-gene interaction

were simulated in the absence or presence of noisedue to genotyping errors, missing data, phenocopy,and genetic heterogeneity. One hundred datasetswith different combinations of sources of noise weregenerated. As this was balanced data with arelatively small sample size, an unmodified SVMwith an exhaustive search was used to identify thetwo functional interacting loci.

AN ILLUSTRATIVE EXAMPLE

Data source. The testing data resource hasbeen described in detail by Xu et al. [2005], includingthe study population, recruitment methods, andgenotyping methods. In brief, this is a case-controlprostate cancer study population that was collectedin Sweden. Fifty-seven SNPs located in 18 geneswere genotyped among 1,355 case and 765 controlsubjects (Table I). The rate of missing values for eachSNP was less than 5%. Some analyses wereperformed in a balanced dataset where 590 controlswere randomly selected from the control pool toobtain 1,355 controls [Xu et al., 2005].

Overall processing. Note that similar to mostother data mining techniques, SVMs cannot handlemissing values, and our pilot computational experi-ments show that accuracy bias does occur whenworking with unbalanced data and missing values.Thus, it is necessary to remove all samples withmissing values. As a result, a lot of information is alsolost. In order to take full advantage of the data underinvestigation, a sequence of nonmissing data subsetswith respect to the considered SNPs is constructed for

computational purposes. For instance, when weconsider the interaction among SNP ]1, SNP ]3, SNP]5 and SNP ]7, a corresponding nonmissing datasubset that consisted of these four SNPs is created fromthe entire set of data. It should be noted that thenonmissing data subsets created from a balanceddataset could become unbalanced as illustrated inFigure 3. The ranges of case-control ratios are [0.9651.033] for one SNP, [0.945 1.053] for two SNPcombinations, [0.931 1.068] for three SNP combina-tions, and [0.921 1.084] for four SNP combinations. Thissuggests that the resulting influence of removingmissing values from unbalanced data is stronger whenmore SNPs are involved. As a result, those nonmissingdata subsets are always considered as unbalanced inthis paper.

In this study, the SNP calls at each site wereconverted to numeric values based on the followingassignments: homozygous for the major allele iscoded as –1, heterozygous is coded as 0, homo-zygous for minor allele is coded as 1, and missingvalues are coded as 2; and individual control andcase outputs are coded as �1 and 1, respectively. AsSVM is capable of performing nonlinear separation,this SNP coding will not affect the outcome of SVM.We selected this coding for the purpose of handlingmissing values. To evaluate the performance of theproposed approaches, we define the accuracy of thecases as (number of cases classified as case)/(totalnumber of cases) and the accuracy of the controlsas (number of controls classified as control)/(totalnumber of controls). The accuracy of the average isdefined as (number of cases classified as case1

TABLE I. Genes and SNPs included in this study

ID Gene (SNP) ID Gene (SNP) ID Gene (SNP)

1 MIC1 (rs1058587) 20 IL6 (rs1800796) 39 TLR7 (rs179019)2 IL-10 (rs1800872) 21 IL6 (rs1800797) 40 TLR8 (rs1548731)3 IL-10 (rs3024509) 22 IRAK1 (rs1059703) 41 TLR8 (rs4830806)4 IL-10 (rs3024505) 23 IRAK1 (rs30278898) 42 TLR8 (rs5744068)5 IL-10 (rs1800869) 24 IRAK4 (rs4251571) 43 TLR9 (rs187084)6 IL-10 (rs1554286) 25 IRAK4 (rs4251487) 44 MyD88 (rs4988453)7 IL-1RN (rs878972) 26 IRAK4 (rs4251545) 45 TIRAP (rs4251431)8 IL-1RN (rs315934) 27 TLR4 (IIPGA-TLR4-15844) 46 TIRAP (TIRAP_14115)9 IL-1RN (rs3087263) 28 TLR4 (IIPGA-TLR4-2856) 47 TIRAP (TIRAP_17678)

10 IL-1RN (rs315951) 29 TLR4 (rs4986790) 48 TLR3 (rs5743305)11 COX2 (rs2745557) 30 TLR4 (rs1927914) 49 TLR5(IIPGA-TLR5-5187)12 COX2 (rs5275) 31 TLR4 (IIPGA-TLR4-14078) 50 TLR5 (rs2072493)13 COX2 (rs4648276) 32 TLR4 (IIPGA-TLR4-18208) 51 TLR5 (rs5744174)14 COX2 (rs689470) 33 TLR4 (rs7873784) 52 TLR5 (rs1053954)15 COX2 (rs20432) 34 TLR1 (rs5743604) 53 TLR7 (rs179008)16 IL6 (rs1474348) 35 TLR2 (rs3804100) 54 TNF (rs2799724)17 IL6 (rs2069845) 36 TLR3 (rs3775296) 55 TNF (rs3093662)18 IL6 (rs2069860) 37 TLR3 (rs5743313) 56 TNF (rs3093664)19 IL6 (rs1800795) 38 TLR7 (rs2302267) 57 TNF (rs3093665)

SNP, single nucleotide polymorphism.

159Support Vector Machine for Gene-Gene Interaction

Genet. Epidemiol. DOI 10.1002/gepi

Page 9: A support vector machine approach for detecting gene-gene interaction

number of controls classified as control)/(totalnumber of cases1total number of controls) and thebalanced accuracy is defined as the average of theaccuracy of the cases and controls (shown inparentheses in the cell of average accuracy).

RESULTS

RESULTS OF SIMULATED DATA

We included the computational results to demon-strate the power of SVM (see Table II). The power

was determined as the proportion of simulateddatasets that were able to correctly detect two lociwith functional interactions. Like MDR, SVM hasexcellent power to detect gene-gene interactionseven in the presence of 5% genotyping error, 5%missing data, or a combination of both under thedifferent two-locus epistasis models with a varietyof allele frequencies. The presence of 50% pheno-copies [defined as an environmentally inducedmimic of a genetic condition in an individual wholacks the usual causative gene; Mange and Mange,1994] did not affect power for models 1 and 2,

TABLE II. Power of MDR and SVM to correctly detect two loci that have functional interactions

Model 1 Model 2 Model 3 Model 4 Model 5 Model 6

Source of noise MDR SVM MDR SVM MDR SVM MDR SVM MDR SVM MDR SVM

None 100 100 100 100 99 100 99 99 82 90 84 92GE 100 100 100 100 100 100 97 100 80 91 92 93GH 3 42 41 53 2 35 3 46 4 29 4 32PC 90 90 99 98 45 41 32 42 30 44 32 46MS 100 100 100 100 99 99 97 100 82 90 87 96

GE1GH 4 54 41 55 2 36 3 42 4 31 6 37E1PC 94 87 99 99 41 47 48 43 28 48 33 51GE1MS 100 100 100 100 98 100 98 99 74 88 84 92GH1PC 0 21 1 30 0 9 0 7 0 10 0 9GH1MS 5 47 38 55 0 35 2 37 4 28 6 34PC1MS 96 91 99 99 42 54 43 49 14 27 16 27

GE1GH1PC 1 18 1 41 0 13 0 6 0 10 0 10GE1GH1MS 6 49 34 58 2 36 1 44 3 24 7 25GH1PC1MS 0 18 0 36 0 9 0 10 0 10 0 11GE1PC1MS 94 86 100 100 48 40 42 50 18 31 16 32

GE1GH1PC1MS 0 18 1 29 0 6 1 8 0 6 0 10

Data description: 10 SNPs, 200 cases and 200 controls, 100 datasets in each test; Results of MDR are from Ritchie et al. [2003a].GE, 5% genotyping error; GH, 50% genetic heterogeneity; PC, 50% phenocopy; MS, 5% missing data (both 5% of cases and control areremoved); MDR, multifactor-dimensionality reduction; SVM, support vector machine.

Fig. 3. Histogram plots for the case-control ratio of non-missing data subsets generated from one to four single nucleotide

polymorphism combinations.

160 Chen et al.

Genet. Epidemiol. DOI 10.1002/gepi

Page 10: A support vector machine approach for detecting gene-gene interaction

but did for models 3–6. The presence of 50%genetic heterogeneity [defined as multiple geneticcauses of the same, or nearly the same, phenotype;Mange and Mange, 1994] did reduce power inall of the models. However, SVM still providesbetter power compared to MDR. In summary,SVM outperforms MDR under all the scenariosfor models 5–6 and generally outperforms formodels 1–4.

EXAMPLE RESULTS BASED ON REAL DATA

Results of SVM-RFE and SVM-RFA. The train-ing and testing results (average accuracy) of SVM-RFE and SVM-RFA are shown in Figure 4, and thecandidate models for 1–5 SNP combinations are

presented in Table III. Because both approachesadopt a similar greedy strategy, they provide similarsuboptimal models. As one might expect, the resultsdemonstrate that SVM-RFE can provide higheraccuracy than SVM-RFA for the present data whenmore than four SNPs are involved. However,compared with SVM-RFE, SVM-RFA is more resis-tant to overfitting for the present data when a smallnumber (o20) of SNPs are involved, because theaccuracy difference between training and testing issmaller. Also as expected, both SVM-RFE and SVM-RFA results are of a lower quality compared with theother approaches (Tables V, IV, VI). Note that thetraining curves increase almost uneventfully whilethe testing curves first increase and then decreasewith respect to the number of SNPs used. Thesetypes of tendencies in the training and testing curvesmight be useful for estimating the maximumnumber of SNPs that should be considered inmodeling interactions, especially when the datacontains a considerable number of SNPs; however,this model selection problem is beyond the scope ofthis paper.

Results of SVM-local and SVM-GA. The resultspresented in Tables IV through XII are the average of10 runs of 10-fold cross-validation. In the columnlabeled SNP ID, some cells include numbers inparentheses, and this refers to the occurrence/number of runs. We performed 10 runs of localsearch when SNPs were considered three at a timeand 20 runs when four and five SNPs wereconsidered. The results of the local search arepresented in Table IV. SVM-GA was performed withthe following GA parameter settings: population 5 30,generation 5 200, crossover rate 5 0.6, mutationrate 5 0.4, and power factor 5 1.2. As the number ofgenerations was small (for computational efficiency),

TABLE III. Results from SVM-RFA and SVM-RFE

Training accuracy (%) Testing accuracy (%)

]SNP SNP ID Case Control Average Case Control Average

1 34 69.87 33.79 51.96 (51.83) 70.06 34.26 52.04 (52.16)46 100 0 49.93 (50.00) 100 0 49.93 (50.00)

2 34 49 55.71 52.30 54.03 (54.00) 55.39 52.20 53.88 (53.79)5 46 59.21 50.55 54.86 (54.88) 58.75 50.52 54.68 (54.64)

3 34 49 14 54.44 56.31 55.37 (55.37) 54.12 55.81 55.00 (54.96)43 5 46 57.62 55.36 56.48 (56.49) 56.19 53.94 55.02 (55.07)

4 34 49 14 36 55.60 57.16 56.38 (56.38) 54.97 56.54 55.82 (55.75)49 43 5 46 59.36 58.42 58.90 (58.89) 54.84 55.27 54.95 (55.08)

5 34 49 14 36 1 63.10 56.84 60.25 (60.27) 60.11 54.02 57.04 (57.06)1 49 43 5 46 62.83 63.69 63.26 (63.26) 57.01 57.82 57.37 (57.41)

There are two corresponding rows for each ]SNP cell. The results in first and second rows are produced by SVM-RFA and SVM-RFE,respectively.SVM-RFA, recursive feature addition SVM; SVM-RFE, recursive feature elimination SVM; SVM, support vector machine; SNP, singlenucleotide polymorphism.

SVMRFE vs SVMRFA

50

60

70

80

90

100

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55

Ave

rag

e A

ccu

racy

RFE-Training

RFE-Testing

RFA-Training

RFA-Testing

Fig. 4. Results from recursive features elimination SVM and

recursive feature addition SVM.

161Support Vector Machine for Gene-Gene Interaction

Genet. Epidemiol. DOI 10.1002/gepi

Page 11: A support vector machine approach for detecting gene-gene interaction

GA did not reach a premature endpoint in this study.The results using SVM-GA are shown in Table V. TheMDR software used in this study is versionv1.0.0rc1, which has the ability to handle unbalanceddata. We executed three runs of MDR with differentrandom seeds used to search for the best candidatemodel from one to five SNP combinations. Then weused the best candidate models to perform 10 runsof 10-fold cross-validation. The results are shown inTable VI.

While considering lower order interactions, thechromosome coding length in GA was too small toallow the mutation operation to run as expected.Moreover, the parameters of population and gen-eration in GA were set relatively small in this study,and as a result, the performance of SVM-Local isbetter than SVM-GA. The proposed approaches(SVM-Local and SVM-GA) do not perform as wellas MDR when considering lower order (1–3) inter-action in terms of both training and testing accuracy.However, when considering 4 or 5 order interac-tions, the proposed approaches provide comparableresults. Note that in MDR analysis the differencebetween training and testing accuracy increasesdramatically with respect to the number of SNPsconsidered. This suggests that MDR might have anoverfitting problem (48%) when considering higherorder interactions (five SNPs), even though it canprovide better testing accuracy (60.16). However,differences between training and testing accuracyfor SVM methods are never greater than 3%, whichdemonstrates that the proposed approach is resistantto overfitting.

The same data, but without control duplication,was used to test the performance of the proposed

TABLE IV. Results from SVM-Local analysis of balanced data

Training accuracy (%) Testing accuracy (%)

]SNP SNP ID Case Control Average Case Control Average

1� 1 61.57 44.24 48.55 (52.86) 61.58 44.21 48.54 (52.90)34 42.10 65.06 53.51 (53.58) 42.11 65.10 53.52 (53.61)

2� 27 46 54.84 53.89 54.13 (54.37) 54.64 54.18 54.64 (54.41)1 49 59.86 50.49 55.17 (55.17) 59.88 50.44 55.15 (55.16)

3 7 34 46 (2/10) 57.49 55.08 55.69 (56.30) 56.48 54.46 55.52 (55.47)57.75 54.80 55.56 (56.31) 57.15 54.26 55.76 (55.01)

4 1 36 43 49 (2/20) 59.66 59.90 59.84 (59.78) 58.17 58.82 58.51 (58.67)54.17 65.49 59.91 (59.83) 51.35 63.35 57.35 (57.35)

5 34 41 45 50 51(1/20)

60.70 62.21 61.83 (61.45) 60.07 58.49 59.25 (59.28)

61.31 61.61 61.54 (61.46) 58.90 59.28 59.05 (59.17)

�Exhaustive searches are only performed in one and two SNP combinations.The statistics for each combination illustrated in the first row were calculated using SVM with accuracy adjustment, while those in thesecond row are done without accuracy adjustment.SVM-Local, SVM with local search; SVM, support vector machine; SNP, single nucleotide polymorphism.

TABLE V. Results from SVM-GA analysis of balanced data

Training accuracy (%) Testing accuracy (%)

]SNP SNP ID Case Control Average Case Control Average

4� 1 36 43 49 59.66 59.90 59.78 (59.84) 58.17 58.82 58.51 (58.67)5 1 27 36 43 49 61.76 62.21 61.99 (61.99) 58.86 59.31 59.13 (59.09)

�The best-found candidate model is produced by GA 1 Local.SVM-GA, SVM with genetic algorithm; SVM, support vector machine; SNP, single nucleotide polymorphism.

TABLE VI. Results from MDR analysis of balanced data

Training accuracy

(%)

Testing accuracy

(%)

]SNP SNP ID Case Control Average Case Control Average

1 34 (3/3) 41.11 66.27 53.69 41.11 66.27 53.69

2 1 49 (3/3) 59.80 51.50 55.65 59.71 51.19 55.45

3 1 36 49 (3/3) 53.36 62.69 58.02 51.55 60.95 56.26

4 5 7 46 49 (2/3) 63.39 60.54 62.20 59.85 56.41 58.19

5 1 5 12 17 49 (2/3) 68.21 68.24 68.23 59.37 60.91 60.16

MDR, multifactor-dimensionality reduction; SNP, single nucleo-tide polymorphism.

162 Chen et al.

Genet. Epidemiol. DOI 10.1002/gepi

Page 12: A support vector machine approach for detecting gene-gene interaction

approach when working with unbalanced data. Inthis case, the data contains 1,355 cases and 765controls. We adopted the same aforementionedcomputational procedure (SVM-Local, SVM-GA,and MDR) to analyze this unbalanced data. Thecomputational results (Table VII, Table VIII, Table IX)demonstrate that the capability to detect gene-geneinteractions of the proposed approaches (SVM-Localor SVM-GA) outperformed MDR. It should be notedthat the proposed approaches not only providehigher accuracy candidate models but also providemore stable and reasonable results. Furthermore, theproposed approach is resistant to overfitting, evenwhen considering higher order interaction.

DISCUSSION

In this study, an SVM approach was proposed todetect gene-gene interaction. The simulation resultshows that SVM is a powerful tool under severalscenarios in the absence of genetic heterogeneity. As

genetic heterogeneity may be common for complexdiseases, the extension of SVM to tackle this issue isimportant. Meanwhile, one sensible approach is todivide the sample into more homogeneous groupsfor analysis [Liang et al., 2003]. Further, the real dataanalyses demonstrate that the proposed approach iscomparable to the most popular approach, MDR.However, in order to present the advantages anddisadvantages of the proposed approach, it isnecessary to understand the fundamental differ-ences between the proposed SVM approach andMDR. Below we provide more detailed comparisonsshowing that our proposed approach is less suscep-tible to overfitting, better able to handle unbalanceddata, and able to achieve more stable models.

It should be noted that the best candidate modelsfound by the proposed approaches presented inTable X are not radically different for either thebalanced (duplicated) or the unbalanced (original)data. Moreover, the MDR approach provides thesame best candidate models for two, three and five

TABLE VII. Results from SVM-Local analysis of unbalanced data

Training accuracy (%) Testing accuracy (%)

]SNP SNP ID Case Control Average Case Control Average

3 1 44 49 (1/10) 56.73 56.50 56.65 (56.41) 55.90 54.76 55.48 (55.33)4 1 31 44 49 (1/20) 58.21 55.80 57.35 (57.01) 57.51 54.70 56.53 (56.11)5 2 3 10 17 54 (1/20) 59.48 59.67 59.55 (59.58) 57.39 55.88 56.81 (56.63)

SVM-Local, SVM with local search; SVM, support vector machine; SNP, single nucleotide polymorphism.

TABLE VIII. Results from SVM-GA searches of unbalanced data

Training accuracy (%) Testing accuracy (%)

]SNP SNP ID Case Control Average Case Control Average

4� 1 31 44 49 58.21 55.80 57.35 (57.01) 57.51 54.70 56.53 (56.11)5� 2 3 15 36 46 60.91 58.97 60.20 (59.94) 58.86 55.31 57.56 (57.08)

�The best-found candidate model is produced by GA1Local.SVM-Local, SVM with local search; SVM, support vector machine; SNP, single nucleotide polymorphism.

TABLE IX. Results from MDR analysis of unbalanced data

Training accuracy (%) Testing accuracy (%)

]SNP SNP ID Case Control Average Case Control Average

1 1 (3/3) 62.51 44.18 55.90 (53.35) 62.50 44.18 55.90 (53.35)2 1 49 (3/3) 59.53 52.71 57.07 (56.12) 59.14 52.21 56.63 (55.67)3 1 36 49 (3/3) 49.08 67.77 55.82 (58.41) 47.32 64.42 53.49 (55.87)4 5 10 12 43 (2/3) 60.05 57.88 59.26 (58.96) 55.52 48.31 52.91 (51.91)5 1 5 12 17 49 (3/3) 65.47 70.85 67.41 (68.16) 55.80 56.51 56.04 (56.15)

MDR, multifactor-dimensionality reduction; SNP, single nucleotide polymorphism.

163Support Vector Machine for Gene-Gene Interaction

Genet. Epidemiol. DOI 10.1002/gepi

Page 13: A support vector machine approach for detecting gene-gene interaction

SNP combinations among both the balanced andunbalanced data (Table VI for balanced data andTable IX for unbalanced data). The results in Table Xalso show that, while dealing with unbalanced data,an enhanced (more local searches or a largerpopulation and more generations in GA) search isneeded to achieve good results.

MDR demonstrates its computational efficiencywhen less than five SNPs are considered. However,in our experience with the present data (more than2,000 individuals and approximately 60 SNPs), anexhaustive search for all five SNP combinationsrequires approximately 6 days to run to completionon a SUN Fire V880 SPARC Server (Santa Clara, CA)using 1 CPU. It would take more than 1 month tocomplete an exhaustive search considering interac-tions among six SNPs. Roughly speaking, the timerequired for MDR to search for the best modelsincreases exponentially with respect to the numberof SNPs considered. An alternative option for thecurrent MDR implementation is to perform aconsiderable number of random searches to detecthigh-order interaction, although as a result, theoutcomes might be unstable and unpredictable.Compared with MDR, the proposed approachesare much more computationally expensive whendealing with low-order interactions (less than sixSNPs). However, the computational time required tosearch for the best models increases roughly poly-nomially with respect to the number of SNPsconsidered. On average, most of the local searchesrequire at most 6 hr to obtain a local optimum. Forbalanced data, GA requires 3–4 days to search for thebest four to five SNPs combinations, respectively,and for unbalanced data it requires 5–8 days,respectively. However, a global optimal is notguaranteed when using such heuristics.

Because the SVM approach cannot handle missingvalues, it is not appropriate to directly compare itwith MDR. Two cross-examinations, using oneapproach to analyze the best candidate modelsproduced by the other approach and vice versa (as

shown in Tables XI and XII), were performed toexplore the fundamental differences between theproposed SVM approach and MDR. Note thatapproximately 90% of the data remain for analysesand the case-control ratios are not radically differentafter removing missing value data. Thus, it is fair tosay that the data are essentially the same before andafter the removal of missing values.

When considering the average (balanced) train-ing/testing accuracy in Table XI, it seems that MDRdoes not make a significant difference while hand-ling data with or without missing values. Whilecomparing the power of classification, for thesesame SNP combinations (the best candidate modelsfrom the MDR analysis), MDR performs better thanSVM in the analysis of balanced data, while SVMdoes better in an analysis of unbalanced data exceptin dealing with five SNP combinations. Note thatsome of the differences in accuracy between trainingand testing or cases and controls in Table XI arerelatively larger than those in Tables IV, V, VII, VIII.In such cases, some of the best candidate modelsselected by MDR are not reasonable and aresuspected to be a result of overfitting. Furthermore,MDR can provide totally opposite prediction modelsfor data with versus without missing values. Forinstance, the accuracy among case and control datafor the best four SNPs model (unbalanced data) is55.52 and 59.52, respectively; however, this changes,dramatically, to 51.01 and 48.31, respectively, whenmissing values are removed. A similar situation wasobserved among the best candidate models for fiveSNPs based on balanced data. It should be notedthat the current implementation of MDR partitionsdata by categorizing all missing values as anotherone genotype category, and uses the case-controlratio to classify these into high- or low-risk groupmembership. This approach of handling missingdata and the high-low risk classification can leadMDR to become very sensitive to missing values andcan sometimes result in the MDR prediction modelbecoming unreliable.

TABLE X. Best candidate models found by SVM: balanced vs. unbalanced data

Balanced data (testing accuracy) Unbalanced data (testing accuracy)

]SNP SNP ID Case Control Average Case Control Average

3 7 34 46 56.48 54.46 55.52 (54.99) 57.08 54.54 56.19 (55.81)1 44 49 51.62 52.61 52.24 (52.11) 55.90 54.76 55.48 (55.33)

4 1 36 43 49 58.17 58.82 58.51 (58.67) 57.31 56.76 57.08 (57.03)1 31 44 49 54.80 56.90 55.95 (55.84) 57.51 54.70 56.53 (56.11)

5 34 41 45 50 51 60.07 58.49 59.25 (58.87) 58.15 56.67 57.63 (57.41)2 3 15 36 46 57.74 56.73 57.23 (57.24) 58.86 55.31 57.56 (57.08)

The best solutions from original analysis are highlighted in bold. The others are performed for the purpose of comparison (balanced vs.unbalanced).SVM, support vector machine; SNP, single nucleotide polymorphism.

164 Chen et al.

Genet. Epidemiol. DOI 10.1002/gepi

Page 14: A support vector machine approach for detecting gene-gene interaction

Although we used MDR to analyze the bestcandidate models produced by the proposed SVMapproach, it is not clear which method has betterpower of classification for both the balanced andunbalanced data (in Table XII). However, by con-sidering all case, control, and average accuracies, theSVM approach provides better and more reasonable

results than MDR in most scenarios. The MDRresults for the balanced data in Table XII are not asgood as that in Table XI but they are comparable,and when considering unbalanced data they arebetter. Moreover, the MDR results in Table XII aremore reasonable than those in Table XI, and MDRexhibits no over-fitting for the best models selected

TABLE XII. Results from MDR analyses to confirm the best candidate models from the SVM approach

Training accuracy Testing accuracy

Data SNP ID Case Control Average Case Control Average

Balanced 7 34 46 57.49 55.08 55.69 (56.30) 56.48 54.46 55.52 (55.47)(91.6%, 1.04) 57.18 55.40 56.68 (56.28) 56.15 54.50 55.35 (55.33)

1 36 43 49 59.66 59.90 59.84 (59.78) 58.17 58.82 58.51 (58.67)(91.5%, 0.97) 56.85 63.04 59.98 (59.95) 53.24 59.56 56.47 (56.40)

34 41 45 50 51 60.70 62.21 61.83 (61.45) 60.07 58.49 59.25 (59.28)(93.8%, 1.03) 59.11 64.50 61.58 (61.80) 56.57 61.71 59.11 (59.14)

Unbalanced 1 44 49 56.73 56.50 56.65 (56.41) 55.90 54.76 55.48 (55.33)(94.7%, 1.80) 60.34 53.16 57.97 (56.75) 59.68 51.14 56.63 (55.41)

1 31 44 49 58.21 55.80 57.35 (57.01) 57.51 54.70 56.53 (56.11)(94.6%, 1.80) 60.55 53.70 58.27 (57.12) 59.82 52.01 57.01 (55.92)2 3 15 36 46 60.91 58.97 60.20 (59.94) 58.86 55.31 57.56 (57.08)(90.7%, 1.75) 62.29 58.22 60.72 (60.26) 58.68 52.86 56.56 (55.77)

There are two corresponding rows for each SNP ID cell. The results in the first and second rows were produced by SVM and MDR,respectively. In the column labeled ‘‘SNP ID’’, the numbers in parentheses refer to the remaining percentage of data after removingcorresponding missing values and the resulting case–control ratio, respectively.MDR, multifactor-dimensionality reduction; SNP, single nucleotide polymorphism; SVM, support vector machine.

TABLE XI. Results from SVM approach analyses to confirm the best candidate models from MDR analysis

Training accuracy Testing accuracy

Data SNP ID Case Control Average Case Control Average

Balanced 1 36 49 53.36 62.69 58.02 51.55 60.95 56.26(93.8%, 0.98) 51.99 62.32 57.53 (57.16) 50.93 61.22 56.12 (56.08)

54.15 54.86 54.51 (54.50) 52.12 53.41 52.79 (52.76)5 7 46 49 63.39 60.54 62.20 59.85 56.41 58.19

(90.9%, 1.04) 60.07 60.92 60.49 (60.49) 55.91 57.54 56.71 (56.72)59.78 60.44 60.10 (60.11) 57.19 57.98 57.53 (57.58)

1 5 12 17 49 68.21 68.24 68.23 59.37 60.91 60.16(89.3%, 1.02) 60.02 71.01 64.98 (65.52) 52.82 65.42 59.11 (59.12)

62.60 63.21 62.90 (62.91) 55.83 57.00 56.40 (56.41)Unbalanced 1 36 49 (93.5%,

1.75)49.08 67.77 55.82 (58.41) 47.32 64.42 53.49 (55.87)49.30 65.94 56.24 (57.62) 47.91 63.55 53.60 (55.73)57.52 56.74 57.23 (57.13) 55.66 52.72 54.57 (54.19)

5 10 12 43 60.05 57.88 59.26 (58.96) 55.52 48.31 52.91 (51.91)(92.2%, 1.75) 54.23 66.50 59.59 (60.37) 51.01 59.52 54.10 (55.27)

59.50 59.91 59.65 (59.70) 55.67 52.98 54.67 (54.32)1 5 12 17 49 65.47 70.85 67.41 (68.16) 55.80 56.51 56.04 (56.15)(89.5%, 1.80) 61.74 67.58 64.72 (64.66) 55.72 55.89 55.77 (55.80)

62.21 62.23 62.22 (62.22) 55.94 50.92 54.14 (53.42)

There are three corresponding rows for each SNP ID cell. The results in the first two rows were produced by an MDR analysis of data withand without missing values. The third row is the result of SVM. In the column labeled ‘‘SNP ID’’, the numbers in parentheses refer to theremaining percentage of data after removing corresponding missing values and the resulting case–control ratio, respectively.SVM, support vector machine; SNP, single nucleotide polymorphism; MDR, multifactor-dimensionality reduction.

165Support Vector Machine for Gene-Gene Interaction

Genet. Epidemiol. DOI 10.1002/gepi

Page 15: A support vector machine approach for detecting gene-gene interaction

by the proposed SVM approach. Note that theoptimal search criterion of MDR maximizesthe balanced accuracy. In contrast with MDR, theproposed SVM approach is not only able tomaximize the average accuracy, but is also ableto balance the case-control accuracy. Furthermore,MDR utilizes user-defined decisions to assign parti-tions with no or equal case-control ratios to eithercases or controls, while the SVM approach does notneed any user-defined decisions for classification.As a result, the SVM approaches have more powerin (function) generalization than does MDR. Thus,the best candidate models selected by the SVMapproach are more reasonable and reliable thanthose selected by MDR.

There is one limitation in our study. The way wehandled the missing data requires the data to bemissing completely at random. Further studies thatrelax this assumption may be warranted. Further-more, recently two imputation approaches havebeen developed by Marchini et al. [2007] andAbecasis et al. [please see Scott et al., 2007]. Theseapproaches can impute the missing genotype data,providing another option to avoid the missing dataproblem.

In this study, a reasonable and applicable classi-fier-optimization framework was considered, andwas used to tackle the problem of gene-geneinteraction. Accordingly, support vector machineand combinatorial optimization techniques (localsearch and GA) were tailored to fit in the frame-work. Although the proposed approach is verycomputationally expensive, the computationalresults are quite encouraging and demonstrateseveral advantages, especially based on benchmark-ing against MDR. Our approach has less concernregarding overfitting, is better able to handleunbalanced data, and selects more stable models. Itis a very promising tool for the identification andcharacterization of high order gene-gene and gene-environment interactions. The goals of this work are:(1) to evaluate and incorporate a different approachto optimization, complexity analysis, and algorith-mic design for gene-gene interaction problems byproviding a reasonable and applicable computa-tional framework; and (2) to make the support vectormachine and combinatorial optimization techniquesmore accessible to genetic epidemiology researchersin order to promote the use and extension of thesepowerful approaches. The SVM software will beavailable from the author when it is ready. However,in order to fully resolve the problem of gene-geneinteraction, it is important to consider the modelselection issues: (1) how many SNPs should beconsidered in ‘‘the best model;’’ and (2) if thereare multiple combinations of SNPs (for the bestmodel) with almost the same performance, how to

distinguish the difference among those combina-tions and to further interpret the results for themodels. Further studies are warranted.

ACKNOWLEDGMENTS

This study was also partially funded by an NCIgrant (CA 1R01CA105055-01A1 to J. Xu).

REFERENCESAlpaydin E. 2004. Introduction to Machine Learning. Cambridge,

Massachusetts: MIT.Chang CC, Lin CJ. 2005. LIBSVM: A library for support vector

machines. Software is available at http://www.csie.ntu.edu.tw/cjlin/libsvm.

Cook N, Zee R, Ridker P. 2004. Tree and spline based associationanalysis of gene-gene interaction models for ischemic stoke.Stat Med 23:1439–1453.

Cordell HJ, Clayton DG. 2005. Genetic association studies. Lancet366:1121–1131.

Cover TC. 1965. Geometrical and statistical properties of system oflinear inequalities with applications in pattern recognition.IEEE Trans Electr Comput 14:326–334.

Cristianini N, Shawe-Taylor J. 2000. An introduction to supportvector machine. New York: Cambridge University Press.

Gen M, Cheng R. 2000. Genetic algorithms and engineeringoptimization. New York: Wiley.

Guyon I, Weston J, Barnhill S, Vapnik V. 2002. Gene selection forcancer classification using support vector machine. MachineLearning 46:389–422.

Hornik K, Stinchcombe M, White H. 1989. Multilayer feedforward networks are universal approximators. Neural Netw2:359–366.

Keerthi SS, Lin CJ. 2003. Asymptotic behaviors of support vectormachines with Gaussian kernel. Neural Comput 15:1667–1689.

Kooperberg C, Ruczinski I. 2005. Identifying interacting SNPsusing Monte Carlo logic regression. Genet Epidemiol28:157–170.

Liang KY, Hsu FC, Beaty TH. 2003. Multipoint linkage disequili-brium mapping for complex diseases. Genet Epidemiol25:285–292.

Listgarten J, Damaraju S, Poulin B, Cook L, Dufour J, Driga A,Mackey J, Wishart D, Greiner R, Zanke B. 2004. Predic-tive models for breast cancer susceptibility from multiplesingle nucleotide polymorphisms. Clin Cancer Res 10:2725–2737.

Lucek P, Ott J. 1997. Neural network analysis of complex traits.Genet Epidemiol 14:1101–1106.

Mange EJ, Mange AP. 1994. Basic human genetics. Sunderland,Massachusetts: Sinauer Associates Inc.

Marchini J, Howie B, Myers S, McVean G, Donnelly P. 2007. A newmultipoint method for genome-wide association studies byimputation of genotypes. Nat Genet 39:906–913.

Moore JH, Williams SM. 2005. Traversing the conceptual dividebetween biological and statistical epistasis: systems biologyand a more modern synthesis. Bioessays 27:637–646.

Moore JH, Hahn LW, Ritchie MD, Thornton TA, White BC. 2002.Application of genetic algorithms to the discovery of complexmodels for simulation studies in human genetics. In: LangdonWB, Cantu-Paz E, Mathias K, Roy R, Davis D, Poli R,Balakrishnan K, Honavar V, Rudolph G, Wegener J, Bull L,Potter MA, Schultz AC, Miller JF, Burke E, Jonoska N, editors.

166 Chen et al.

Genet. Epidemiol. DOI 10.1002/gepi

Page 16: A support vector machine approach for detecting gene-gene interaction

Proceedings of the Genetic and Evolutionary ComputationConference. San Francisco: Morgan Kaufmann Publishers.p 1150–1155.

Nelson M, Kardia S, Ferrell R, Sing C. 2001. A combinatorialpartitioning method to identify multilocus genotypic partitionsthat predict quantitative trait variation. Genome Res11:458–470.

Ritchie M, Hahn L, Roodi N, Bailey R, Dupont W, Parl F, Moore J.2001. Multifactor-dimensionality reduction reveals high-orderinteractions among estrogen-metabolism genes in sporadicbreast cancer. Am J Hum Genet 69:138–147.

Ritchie M, Hahn L, Moore J. 2003a. Power of multifactordimensionality reduction for detecting gene-gene interactionsin the presence of genotyping error, missing data, phenocopy,and genetic heterogeneity. Genet Epidemiol 24:150–157.

Ritchie M, White B, Parker J, Hahn L, Moore J. 2003b.Optimization of neural network architecture improves thepower to identify gene-gene interaction in common diseases.BMC Bioinformatics 4:28.

Ruczinski I, Kooperberg C, LeBlanc M. 2003. Logic regression.J Comput Graph Stat 12:475–511.

Schwender H, Zucknick M, Ickstadt K, Bolt H, the GENICAnetwork. 2004. A pilot study on the application of statisticalclassification procedure to molecular epidemiological data.Toxicol Lett 151:291–299.

Scott LJ, Mohlke KL, Bonnycastle LL, Willer CJ, Li Y, Duren WL,

Erdos MR, Stringham HM, Chines PS, Jackson AU, Prokunina-

Olsson L, Ding CJ, Swift AJ, Narisu N, Hu T, Pruim R, Xiao R,

Li XY, Conneely KN, Riebow NL, Sprau AG, Tong M, White

PP, Hetrick KN, Barnhart MW, Bark CW, Goldstein JL, Watkins

L, Xiang F, Saramies J, Buchanan TA, Watanabe RM, Valle TT,

Kinnunen L, Abecasis GR, Pugh EW, Doheny KF, Bergman RN,

Tuomilehto J, Collins FS, Boehnke M. 2007. A genome-wide

association study of type 2 diabetes in Finns detects multiple

susceptibility variants. Science 316:1341–1345.

Thornton-Wells T, Moore J, Haines L. 2004. Genetics, statistics and

human disease: analytical retooling for complexity. Trends

Genet 20:640–647.

Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD,

Williams SM, Moore JH. 2007. A balanced accuracy

function for epistasis modeling and imbalanced datasets using

multifactor dimensionality reduction. Genet Epidemiol 31:

306–315.

Xu J, Lowey J, Wiklund F, Sun J, Lindmark F, Hsu FC, Dimitrov L,

Chang B, Turner A, Liu W, Adami HO, Suh E, Moore JH,

Zheng SL, Isaacs WB, Trent JM, Gronberg H. 2005. The

interaction of four genes in the inflammation pathway

significantly predicts prostate cancer risk. Cancer Epidemiol.

Biomarkers Prev 14:2563–2568.

167Support Vector Machine for Gene-Gene Interaction

Genet. Epidemiol. DOI 10.1002/gepi