1 7/27/2008 center for computational intelligence, learning, and discovery bioinformatics and...

Download 1 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting A Computational

Post on 20-Dec-2015

214 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • Slide 1
  • 1 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting A Computational Method to Identify Amino Acid Residues in RNA-protein Interactions Michael Terribilini & Jae-Hyung Lee Cornelia Caragea, Deepak Reyon, Ben Lewis, Jeffry Sander, Robert Jernigan, Vasant Honavar and Drena Dobbs Bioinformatics and Computational Biology Program Center for Computational Intelligence, Learning, and Discovery L.H. Baker Center for Bioinformatics & Biological Statistics BCB NSF IGERT
  • Slide 2
  • 2 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting PROBLEM: Given the sequence of a protein (& possibly its structure), predict which amino acids participate in protein-RNA interactions APPROACH: Generate datasets of known complexes from PDB to train & test machine learning algorithms (Nave Bayes, SVM, etc.) GOAL: Classify each amino acid in target protein as either interface or non-interface residue Guiding hypothesis: Principal determinants of protein binding sites are reflected in local sequence features Observation: Binding site residues are often clustered within primary amino acid sequence
  • Slide 3
  • 3 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting Sequence-Based Classifier: RB181 non-redundant dataset: 181 protein-RNA complexes from the PDB Input: window of amino acid identities centered on target & contiguous in protein sequence Classifier: Nave Bayes Leave-one-out cross validation QSVSTSSFRYM Ser 28 Structure-Based Classifier: Calculate distance between each pair of residues in known structure Input: identities of the nearest n spatial neighbors Classifier: Nave Bayes Leave-one-out cross validation SSFRLNKSGRT Ser 28 PSSM-Based Classifier: PSI-BLAST against NCBI nr database to generate PSSMs Input: PSSM vectors for residues contiguous in sequence Classifier: Support Vector Machine (SVM) 10-fold cross validation Ser 28 -3,7,8, 5,-4,-6, ,5,9,-1, QSVSTSSFRYM 20 PROBLEM: Given the sequence of a protein (& possibly its structure), predict which amino acids participate in protein-RNA interactions
  • Slide 4
  • 4 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting Dataset of RNA-protein Interface Residues Extract All Protein-RNA Complexes Select high resolution structures < 3.5 Res PDB 503 Complexes 181 Chains 48,791 Residues Filter using PISCES < 30% pair-wise sequence identity Identify Interface Residues using distance cutoff 5 7,456 Interface Residues (Positive examples) 41,335 Non-Interface Residues (Negative examples) PISCES: Wang and Dunbrack, 2003 Bioinformatics, 19:1589
  • Slide 5
  • 5 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting ComplexProtein-ProteinProtein-DNAProtein-RNA Classifier 2-stage classifier SVM + Nave Bayes Nave Bayes Accuracy72 %77 %85 % Specificity58 %37 %51 % Sensitivity39 %43 %38 % Correlation coefficient 0.300.250.35 Reference Yan et al., 2004 Bioinformatics Yan et al., 2006 BMC Bioinformatics Terribilini et al., 2006 RNA Related work Jones & Thornton, Ofran & Rost many others Jones et al., Thornton et al., Ahmad & Sarai Jeong et al., Miyano et al., Go et al. Performance in predicting interface residues Using only protein sequence as input
  • Slide 6
  • 6 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting Na ve Bayes 2-stage classifier SVM + Na ve Bayes Protein-RNAProtein-DNAProtein-Protein Yan Bioinformatics 2004 ; Yan BMC Bioinformatics 2006 ; Terribilini RNA 2006 Ab FabN10 Acc = 87% CC = 0.65 Repressor Acc = 88% CC = 0.66 dsRNA Binding Protein Acc = 86% CC = 0.59 A few "good" predictions mapped onto structures Using only protein sequence as input
  • Slide 7
  • 7 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting ID-SeqID-StructID-PSSMCombined 1 Specificity0.550.630.510.53 2 Sensitivity0.300.320.430.49 Accuracy0.860.870.850.86 Correlation Coefficient0.330.38 0.43 3 AUC of ROC0.730.770.790.81 Combining Sequence, Structure & PSSM-Based Classifiers Improves Prediction of RNA-Binding Residues Predictions illustrated on 3D structures: 30S ribosomal protein S17 (PDB ID 1FJG:Q) Sequence-BasedStructure-BasedPSSM-Based Combined (For clarity, bound RNA is not shown) TP = True Positive = interface residues predicted as such FP = False Positive = non-interface residues predicted as interface residues TN = True Negative = non-interface residues predicted as such FN = False Negative = interface residues predicted as non-interface Combined Results for 1FJG:Q: Spec+ = 0.89 Sens+ = 0.96 Accuracy = 0.91 Correlation Coefficient = 0.83 1 Specificity (Precision for the positive, RNA-binding class) 2 Sensitivity (Recall for the positive, RNA-binding class) 3 Area Under the Curve (AUC) from a Receiver Operating Characteristic (ROC) curve
  • Slide 8
  • 8 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting IDSeq Predictions Accuracy = 80% Specificity = 56% Sensitivity = 21% CC = 0.25 Combined Predictions Accuracy = 82% Specificity = 55% Sensitivity = 75% CC = 0.52 Predictions for Signal Recognition Particle 19kDa protein (PDB ID 1JID_A)
  • Slide 9
  • 9 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting RNABindR: An RNA Binding Site Prediction Server
  • Slide 10
  • 10 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting Applications Lentiviral Rev proteins Telomerase Reverse Transcriptase (TERT) http://telomerase.asu.edu/
  • Slide 11
  • 11 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting Rev - a potential target for novel HIV therapies Rev is a multifunctional regulatory protein that plays an essential role in the production of infectious virus A small nucleo-plasmic shuttling protein (HIV Rev 115 aa; EIAV Rev 165 aa) Recognizes a specific binding site on viral RNA Rev Responsive Element (RRE) Contains specific domains that mediate nuclear localization, RNA binding and nuclear export Rev's critical role in lentiviral replication makes it an attractive target for antiviral (AIDs) therapy
  • Slide 12
  • 12 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting Why? Rev aggregates at concentrations needed for NMR or X- ray crystallography The only high resolution information available is for short peptide fragments of HIV-1 Rev: a 22 amino acid fragment of Rev bound to a 34 nucleotide RRE RNA fragment What about insights from sequence comparisons? HIV Rev sequence has low sequence identity with proteins with known structure Very little sequence similarity among different Rev family members (e.g., EIAV vs HIV < 10%) Problem: no high resolution Rev structure! - not even for HIV Rev, despite intense effort
  • Slide 13
  • 13 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting HIV-1 Rev: Predictions vs Experiments Prediction on RNA-binding protein HIV-1 Rev 33 43 53 DTRQARRNRR RRWRERQRAA AA ++++++++++ ++++++++ Actual IR Predicted Sequence based prediction on HIV-1 Rev (not included in the training set) identified every interface residue, plus 3 false positives Predicted Actual NMR structure (1ETF:B): 22 aa Rev peptide bound to RNA Battiste et al., 1996,Science 273:1547 Interface residues = red Non-interface residues = grey RNA = green
  • Slide 14
  • 14 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting PREDICTED: Structure Protein binding residues RNA binding residues KRRRK RRDRW EIAV Rev: Predictions vs Experiments + + 131 141 151 161 QRGDFSAWGDYQQAQERRWGEQSSPRVLRPGDS KRRRK HL ++++++++++ ++ +++ ++++ ++ 61 71 81 91 ARRHLGPGPTQHTPS RRDRW IREQILQAEVLQ ERLE WRI +++++++++++++++ +++++++++++++ +++ 41 51 GP L ESDQWCRV L RQS L PEEKISSQTCI ++ + +++++ + + RRDRW ERLE KRRRK NES NLS 57 125145 165 31 Lee J Virol 2006; Terribilini RNA 2006 VALIDATED: Protein binding residues RNA binding residues 57-165 MBP WT 31-16531-145 145-165 Ihm Ho Carpenter
  • Slide 15
  • 15 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting AADAA AALA KAAAK ERDE RRDRW ERLE KRRRK NES NLS 57 125145 165 31 KAAAK AADAA AALA ERDE WT 131 141 151 161 QRGDFSAWGDYQQAQERRWGEQSSPRVLRPGDS KRRRK HL ++++++++++ ++ +++ ++++ ++ 61 71 81 91 ARRHLGPGPTQHTPS RRDRW IREQILQAEVLQ ERLE WRI +++++++++++++++ +++++++++++++ +++ 41 51 GP L ESDQWCRV L RQS L PEEKISSQTCI ++ + +++++ + + Mutations in EIAV Rev: Experimental evaluation of RNA binding sites Lee J Virol 2006; Terribilini RNA 2006
  • Slide 16
  • 16 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinforma