a comparison of machine learning algorithms for chemical toxicity classification using a simulated...
Post on 30-Sep-2016
Embed Size (px)
ssBioMed CentBMC Bioinformatics
Open AcceResearch articleA comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data modelRichard Judson*1, Fathi Elloumi1, R Woodrow Setzer1, Zhen Li2 and Imran Shah1
Address: 1National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, North Carolina 27711, USA and 2Dept of Biostatistics University of North Carolina, Chapel Hill, 3126 McGavran-Greenberg Hall, CB #7420, Chapel Hill, NC 27599-7420, USA
Email: Richard Judson* - email@example.com; Fathi Elloumi - firstname.lastname@example.org; R Woodrow Setzer - email@example.com; Zhen Li - firstname.lastname@example.org; Imran Shah - email@example.com
* Corresponding author
AbstractBackground: Bioactivity profiling using high-throughput in vitro assays can reduce the cost andtime required for toxicological screening of environmental chemicals and can also reduce the needfor animal testing. Several public efforts are aimed at discovering patterns or classifiers in high-dimensional bioactivity space that predict tissue, organ or whole animal toxicological endpoints.Supervised machine learning is a powerful approach to discover combinatorial relationships incomplex in vitro/in vivo datasets. We present a novel model to simulate complex chemical-toxicology data sets and use this model to evaluate the relative performance of different machinelearning (ML) methods.
Results: The classification performance of Artificial Neural Networks (ANN), K-NearestNeighbors (KNN), Linear Discriminant Analysis (LDA), Nave Bayes (NB), Recursive Partitioningand Regression Trees (RPART), and Support Vector Machines (SVM) in the presence and absenceof filter-based feature selection was analyzed using K-way cross-validation testing and independentvalidation on simulated in vitro assay data sets with varying levels of model complexity, number ofirrelevant features and measurement noise. While the prediction accuracy of all ML methodsdecreased as non-causal (irrelevant) features were added, some ML methods performed betterthan others. In the limit of using a large number of features, ANN and SVM were always in the topperforming set of methods while RPART and KNN (k = 5) were always in the poorest performingset. The addition of measurement noise and irrelevant features decreased the classificationaccuracy of all ML methods, with LDA suffering the greatest performance degradation. LDAperformance is especially sensitive to the use of feature selection. Filter-based feature selectiongenerally improved performance, most strikingly for LDA.
Conclusion: We have developed a novel simulation model to evaluate machine learning methodsfor the analysis of data sets in which in vitro bioassay data is being used to predict in vivo chemicaltoxicology. From our analysis, we can recommend that several ML methods, most notably SVM andANN, are good candidates for use in real world applications in this area.
Published: 19 May 2008
BMC Bioinformatics 2008, 9:241 doi:10.1186/1471-2105-9-241
Received: 22 January 2008Accepted: 19 May 2008
This article is available from: http://www.biomedcentral.com/1471-2105/9/241
2008 Judson et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 16(page number not for citation purposes)
BMC Bioinformatics 2008, 9:241 http://www.biomedcentral.com/1471-2105/9/241
BackgroundA daunting challenge faced by environmental regulatorsin the U.S. and other countries is the requirement thatthey evaluate the potential toxicity of a large number ofunique chemicals that are currently in common use (inthe range of 10,00030,000) but for which little toxicol-ogy information is available. The time and cost requiredfor traditional toxicity testing approaches, coupled withthe desire to reduce animal use is driving the search fornew toxicity prediction methods [1-3]. Several efforts arestarting to address this information gap by using relativelyinexpensive, high throughput screening approaches inorder to link chemical and biological space [1,4-21]. TheU.S. EPA is carrying out one such large screening and pri-oritization experiment, called ToxCast, whose goal is todevelop predictive signatures or classifiers that can accu-rately predict whether a given chemical will or will notcause particular toxicities . This program is investigat-ing a variety of chemically-induced toxicity endpointsincluding developmental and reproductive toxicity, neu-rotoxicity and cancer. The initial training set being usedcomes from a collection of ~300 pesticide active ingredi-ents for which complete rodent toxicology profiles havebeen compiled. This set of chemicals will be tested in sev-eral hundred in vitro assays.
The goal of screening and prioritization projects is to dis-cover patterns or signatures in the set of high throughputin vitro assays (high throughput screening or HTS, highcontent screening or HCS, and genomics) that are stronglycorrelated with tissue, organ or whole animal toxicologi-cal endpoints. One begins with chemicals for which toxi-cology data is available (training chemicals) and developsand validates predictive classification tools. Supervisedmachine learning (ML) approaches can be used todevelop empirical models that accurately classify the tox-icological endpoints from large-scale in vitro assay datasets. This approach is similar to QSAR (quantitative struc-ture activity relationship), which uses inexpensive calcu-lated chemical descriptors to classify a variety of chemicalphenotypes, including toxicity. By analogy, one could usethe term QBAR (for quantitative bio-assay/activity rela-tionship) to describe the use of in vitro biological assays topredict chemical activity. The QBAR strategy we describehere is also related to biomarker discovery from large-scale -omic data that is used to predict on- or off-targetpharmacology in drug development, or to discover accu-rate surrogates for disease state or disease progression.
The QBAR in vitro toxicology prioritization approach facesa number of inter-related biological and computationalchallenges. First, there may be multiple molecular targetsand mechanisms by which a chemical can trigger a biolog-
ple techniques (including ML methods) may be requiredto discover the underlying relationships between bio-assays and endpoint activity. Second, our present under-standing of biological mechanisms of toxicity (oftenreferred to as toxicity pathways) is relatively limited, sothat one cannot a priori determine which of a set of assayswill be relevant to a given toxicity phenotype. As a conse-quence, the relevant features may be missing from thedata set and (potentially many) irrelevant features may beincluded. Here, by relevant features we mean data fromassays that measure processes causally linked to the end-point of interest. By extension, irrelevant features includedata from assays not causally linked to the endpoint. Thepresence of multiple irrelevant assays or features must beeffectively managed by ML methods. Third, due to thehigh cost of performing the required in vivo studies, thereare limited numbers of chemicals for which high qualitytoxicology data is available, and typically only a smallfraction of these will clearly demonstrate the toxic effectbeing studied. The small numbers of examples and unbal-anced distribution of positive and negative instance for atoxicological endpoint can limit the ability of ML meth-ods to accurately generalize. In order to develop effectiveQBAR models of toxicity, these issues must be consideredin the ML strategy.
Four critical issues for evaluating the performance of MLmethods on complex datasets are: (1) the data set ormodel; (2) the set of algorithms evaluated; (3) themethod that is used to assess the accuracy of the classifica-tion algorithm; and (4) the method that is used for featureselection. In order to address the first issue, it was neces-sary to develop a model of chemical toxicity that capturedthe key points of the information flow in a biological sys-tem. The mathematical model we use is based on the fol-lowing ideas.
1. There are multiple biological steps connecting the ini-tial interaction of a molecule with its principle target(s)and the emergence of a toxic phenotype. The molecularinteraction can trigger molecular pathways, which whenactivated may lead to the differential activation of morecomplex cellular processes. Once enough cells areaffected, a tissue or organ level phenotype can emerge.
2. There will often be multiple mechanisms that give riseto the same phenotype, and this multiplicity of causalmechanisms likely exists at all levels of biological organi-zation. Multiple molecular interactions can lead to a sin-gle pathway being differentially regulated. Up-regulationof multiple pathways can lead to the expression of thesame cellular phenotype. This process continues throughthe levels of tissue, organ and whole animal. One canPage 2 of 16(page number not for citation purposes)
ical response. Assuming that these alternative biologicalmechanisms of action are represented in the data, multi-
think of the chain of causation between molecular triggers
BMC Bioinformatics 2008, 9:241 http://www.biomedcentral.com/1471-2105/9/241
and endpoints as a many-branched tree, potentially withfeedback from higher to lower levels of organization.
3. The number of assays one needs to measure is large,given our relative lack of knowledge of the underlyingmechanism linking direct chemical interactions with toxicen