feature extraction artificial intelligence research laboratory bioinformatics and computational...

Download Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and

If you can't read please download the document

Post on 19-Jan-2016




3 download

Embed Size (px)


  • Compared two variants of k-fold cross-validation: window-based and sequence-based k-fold cross-validation. Results suggest that window-based can yield overly optimistic estimates of the performance of the classifiers relative to the estimates obtained using sequence-based cross-validation. Because predictors trained on labeled sequence data have to predict the labels for residues in a novel sequence, we believe that the estimates obtained using sequence-based cross-validation provide more realistic estimates of performance than those obtained using window-based cross-validation.Feature ExtractionArtificial Intelligence Research LaboratoryBioinformatics and Computational Biology ProgramComputational Intelligence, Learning, and Discovery ProgramDepartment of Computer ScienceAssessing the Performance of Macromolecular Sequence Classifiers

    Cornelia Caragea, Jivko Sinapov, Michael Terribilini, Drena Dobbs and Vasant Honavar IntroductionResultsAcknowledgements: This work is supported in part by a grant from the National Institutes of Health (GM 066387) to Vasant Honavar & Drena Dobbs

    Machine Learning ClassifiersDatasetsSequence-based Cross-Validation: the training and test data typically correspond to disjoint sets of sequences. All instances belonging to the same sequence end up in the same set, preserving the natural distribution of the original sequence dataset.Fig 1. Comparison of Area Under the ROC Curve (AUC) (upper plots) and Matthews Correlation Coefficient (lower plots) between window-based and sequence-based cross-validation with varying dataset size.a) O-glycBaseb) RNA-Protein Interfacec) Protein-Protein InterfaceMachine learning approaches offer some of the most cost-effective approaches to building predictive models (e.g., classifiers) in a broad range of applications in computational biology, e.g., given an amino acid sequence, identifying the amino acid residues that are likely to bind to RNA. Comparing the effectiveness of different algorithms requires reliable procedures for accurately assessing the performance (e.g., accuracy, sensitivity, and specificity) of the resulting predictive classifiers. Evaluating the performance of classifiersK-Fold Cross-ValidationWindow-based Cross-Validation: the training and test data typically correspond to disjoint sets of sequence windows. Similar or identical instances are removed from the dataset to avoid overestimation of performance measures.


    Support Vector Machine: 0/1 String KernelO-GlycBase dataset: contains experimentally verified glycosylation sites compiled from protein databases and literature. (http://www.cbs.dtu.dk/databases/OGLYCBASE/)

    RNA-Protein Interface dataset, RP147: consists of RNA-binding protein sequences extracted from structures of known RNA-protein complexes solved by X-ray crystallography in the Protein Data Bank. (http://bindr.gdcb.iastate.edu/RNABindR/)

    Protein-Protein Interface dataset: consists of protein-binding protein sequences.Table 1. Number of positive (+) and negative (-) instances used in our experiments for O-GlycBase, RNA-Protein, and Protein-Protein Interface datasets.Local window of length 2n+1: x = x-nx-n+1x-1x0x1xn-1xn, with each target residue x0 in the middle and its n neighbor residues, xi, i = -n,,n, i0, on each side as input to the classifier. xi , i = -n,,n, and x *, where represents the 20 amino acid alphabet.

    For the glycosylation dataset: a local window is extracted for each S/T glycosylation or non-glycosylation site, x0 {S,T}.

    For RNA-Protein and Protein-Protein Interface datasets: a local window is extracted for every residue in a protein sequence, x0 , using the sliding window approach.ConclusionEliminating similar or identical sequence windows from the dataset perturbs the natural distribution of the data extracted from the original sequence dataset. Ideally, the performance of the classifier must be estimated using the natural data distribution.Train and test sets are likely to contain some instances that originate from the same sequence. This violates the independence assumption between train and test sets.

    Nave Bayes: Identity windows

    DatasetNumber of SequencesNumber of + InstancesNumber of - InstancesO-GlycBase216216812147RNA-Protein147433627988Protein-Protein4223509204


View more >