Download - IT og Sundhed 2009/10
IT og Sundhed 2009/10
Sequence based predictors. Secondary structure and surface
accessibility
Bent Petersen7 January 2010
NetSurfPReal Value Solvent Accessibility
predictions with amino acid associated reliability
Objective
• Predict residues as being either buried or exposed (25 % threshold)
- Two states/classes, Buried/Exposed
• Predict the Relative Solvent Accessibility, RSA
- “Real” Value
What is ASA?
• Accessible Solvent Area, Å2
• Surface area accessible to a rolling water molecule
RSA
RSA = Relative Solvent AccessibilityACC = Accessible area in protein structureASA = Accessible Surface Area in Gly-X-Gly or Ala-X-Ala
Classification Networks “Real” value Networks
Classification: Buried = RSA < 25 %, Exposed = RSA > 25 %“Real” Value: values 0 - 1, RSA > 1 set to 1
Why predict RSA?
• Residues exposed on surface can be:- Involved in PTM’s- Potential epitopes- Involved in Protein-Protein interactions- Prediction of Disease-SNP’s
How to start?
•What do we want?
- We want to be able to predict the exposure of an AA
•What do we need?
- A training dataset and an independent evaluation dataset
•What information do we need?
- True structural information the Neural Network can train on
•Where do we get that?
- PDB, DSSP
Protein Data Bank, PDB
Berman, H.M., et al., The Protein Data Bank. Nucl. Acids Res., 2000. 28(1): p. 235-242.
Define Secondary Structure of Proteins, DSSP
Kabsch, W. and C. Sander, Dictionary of Protein Secondary Structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 1983. 22(12): p. 2577--2637.
==== Secondary Structure Definition by the program DSSP, updated CMBI version by ElmK / April 1,2000 ==== DATE=23-MAR-2009 .REFERENCE W. KABSCH AND C.SANDER, BIOPOLYMERS 22 (1983) 2577-2637 .HEADER TOXIN 12-AUG-98 3BTA .COMPND 2 MOLECULE: PROTEIN (BOTULINUM NEUROTOXIN TYPE A); .SOURCE 2 ORGANISM_SCIENTIFIC: CLOSTRIDIUM BOTULINUM; .AUTHOR R.C.STEVENS,D.B.LACY . 1277 2 2 1 1 TOTAL NUMBER OF RESIDUES, NUMBER OF CHAINS, NUMBER OF SS-BRIDGES(TOTAL,INTRACHAIN,INTERCHAIN) . 55121.0 ACCESSIBLE SURFACE OF PROTEIN (ANGSTROM**2) . 815 63.8 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(J) , SAME NUMBER PER 100 RESIDUES . 24 1.9 TOTAL NUMBER OF HYDROGEN BONDS IN PARALLEL BRIDGES, SAME NUMBER PER 100 RESIDUES . 198 15.5 TOTAL NUMBER OF HYDROGEN BONDS IN ANTIPARALLEL BRIDGES, SAME NUMBER PER 100 RESIDUES . 1 0.1 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I-5), SAME NUMBER PER 100 RESIDUES . 10 0.8 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I-4), SAME NUMBER PER 100 RESIDUES . 125 9.8 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+2), SAME NUMBER PER 100 RESIDUES . 134 10.5 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+3), SAME NUMBER PER 100 RESIDUES . 276 21.6 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+4), SAME NUMBER PER 100 RESIDUES . 9 0.7 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+5), SAME NUMBER PER 100 RESIDUES . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 *** HISTOGRAMS OF *** . 0 0 0 0 0 3 3 1 2 1 0 3 1 1 0 1 0 0 1 0 1 0 1 1 0 0 0 0 0 2 RESIDUES PER ALPHA HELIX . 2 0 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PARALLEL BRIDGES PER LADDER . 15 10 7 5 8 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ANTIPARALLEL BRIDGES PER LADDER . 3 3 0 0 1 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 LADDERS PER SHEET . # RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA 1 1 A P 0 0 5 0, 0.0 2,-3.8 0, 0.0 3,-0.2 0.000 360.0 360.0 360.0 132.0 74.7 55.7 73.4 2 2 A F - 0 0 115 92,-0.4 93,-0.1 1,-0.1 36,-0.1 -0.206 360.0-142.1 55.7 -62.1 74.7 59.2 74.7 3 3 A V - 0 0 11 -2,-3.8 35,-0.2 91,-0.1 -1,-0.1 0.867 4.9-143.8 70.2 103.3 78.3 59.8 73.7 4 4 A N S S+ 0 0 127 33,-0.3 2,-0.5 -3,-0.2 33,-0.1 0.914 73.7 44.0 -67.5 -53.8 80.1 61.9 76.4 5 5 A K S S- 0 0 94 32,-0.1 2,-0.5 1,-0.0 -1,-0.1 -0.857 79.6-124.0-105.1 133.1 82.5 64.2 74.5 6 6 A Q - 0 0 192 -2,-0.5 2,-0.1 1,-0.1 82,-0.1 -0.568 35.9-150.4 -71.8 118.5 81.6 66.2 71.4 7 7 A F - 0 0 14 -2,-0.5 2,-0.3 80,-0.1 3,-0.1 -0.388 16.9-164.3 -91.4 166.8 84.2 65.3 68.7 8 8 A N > - 0 0 71 -2,-0.1 3,-0.9 1,-0.1 77,-0.0 -0.977 28.9-124.4-143.4 141.5 85.7 67.1 65.7 9 9 A Y T 3 S+ 0 0 17 -2,-0.3 -1,-0.1 1,-0.2 72,-0.1 0.908 109.3 50.7 -57.8 -43.3 87.5 65.3 62.9 10 10 A K T 3 S+ 0 0 141 -3,-0.1 -1,-0.2 70,-0.1 3,-0.1 0.650 77.9 122.5 -70.3 -17.2 90.7 67.4 63.3 11 11 A D S < S- 0 0 45 -3,-0.9 3,-0.1 1,-0.1 2,-0.1 -0.203 77.6 -91.4 -48.0 134.3 91.0 66.8 67.1 12 12 A P - 0 0 99 0, 0.0 -1,-0.1 0, 0.0 -2,-0.1 -0.246 38.0-108.3 -57.6 128.3 94.4 65.3 67.8 13 13 A V + 0 0 41 -3,-0.1 6,-0.2 1,-0.1 4,-0.1 -0.238 38.6 179.2 -51.8 138.5 94.8 61.5 67.8 14 14 A N - 0 0 67 4,-3.7 2,-1.4 2,-0.2 5,-0.2 -0.085 45.1-107.4-144.3 45.7 95.4 60.3 71.4 15 15 A G S S+ 0 0 0 122,-0.4 2,-0.3 3,-0.2 4,-0.2 0.248 100.3 58.5 54.3 -18.1 95.7 56.6 71.7 16 16 A V S S- 0 0 72 -2,-1.4 -2,-0.2 2,-0.5 20,-0.1 -0.996 116.3 -7.4-142.5 145.9 92.2 56.3 73.3 17 17 A D S S+ 0 0 22 -2,-0.3 19,-2.5 18,-0.1 2,-0.2 0.389 136.6 45.3 53.3 -7.2 88.7 57.3 72.3 18 18 A I E S+A 35 0A 6 17,-0.3 -4,-3.7 -11,-0.0 -2,-0.5 -0.649 85.9 128.7-161.1 96.3 90.4 59.0 69.2
Define Secondary Structure of Proteins, DSSP
• DSSP defines 8 types of secondary structure
- G = 3-turn helix (3-10 helix)- H = 4-turn helix (α-helix)- I = 5-turn helix (π-helix)- T = Hydrogen bonded turn (3, 4 or 5 turn)- E = Extended strand- B = Residue in isolated β-bridge- S = Bend- Rest is C = coil
Required datasets
• Training/test
- Used for optimization of settings using 10-fold cross-validation
• Evaluation
- Used for final evaluation, less than 25 % homolog to the training/test dataset.
10-fold Cross Validation
10-fold Cross Validation- Break dataset into 10 sets of size 1/10
- Train on 9 datasets and test on 1
- Repeat 10 times and take a mean accuracy
Learning / Training dataset
• Training set: Cull_1764:
- Max. Seq. ID: 25 %- Resolution: ≤ 2.0 Å- R-Factor: ≤ 0.2- Seq. Length 30-3000 AA- Including X-ray entries only
PISCES
Learning / Training dataset
• Homology reduced towards evaluation set CB513 (302 sequences removed)
• Final Training set:- 1764 sequences- 417.978 amino acids
‣ Buried: 55.80 % (233.221 amino acids)‣ Exposed: 44.20 % (184.757 amino acids)
Learning / Training dataset---Sequence/residue statistics---Number of sequences: 1764Longest sequencese: 1T3T.A(1283)Shortest sequence: 1YTV.M(6)Number of amino acids: 417978
---Assignment category statistics ---B 184757 ( 44.20%)A 233221 ( 55.80%)
---Amino acid statistics---H 10025 ( 2.40%)G 31743 ( 7.59%)Y 14927 ( 3.57%)V 30171 ( 7.22%)E 27774 ( 6.64%)S 24430 ( 5.84%)P 19589 ( 4.69%)A 35658 ( 8.53%)R 21435 ( 5.13%)Q 15535 ( 3.72%)C 5202 ( 1.24%)K 23054 ( 5.52%)L 38489 ( 9.21%)N 17756 ( 4.25%)T 22998 ( 5.50%)F 17181 ( 4.11%)D 24743 ( 5.92%)I 23550 ( 5.63%)W 6365 ( 1.52%)M 7353 ( 1.76%)
Evaluation dataset
• Final Evaluation dataset:
• CB513:- 513 non-homologous sequences- Seq. Length 20-754 aa- 84.119 amino acids- Buried: 55.81 % (46.948 amino acids)- Exposed: 44.19 % (37.171 amino acids)
Evaluation dataset---Sequence/residue statistics---Number of sequences: 513Longest sequence: 6acn.all(754)Shortest sequence: 1atpi-1-DOMAK.all(20)Number of amino acids: 84119
---Assignment category statistics ---B 37171 ( 44.19%)A 46948 ( 55.81%)
---Amino acid statistics---R 3812 ( 4.53%)T 5015 ( 5.96%)D 4973 ( 5.91%)C 1381 ( 1.64%)Y 3065 ( 3.64%)G 6657 ( 7.91%)N 3976 ( 4.73%)V 5795 ( 6.89%)I 4642 ( 5.52%)A 7267 ( 8.64%)S 5222 ( 6.21%)K 4976 ( 5.92%)P 3903 ( 4.64%)E 5050 ( 6.00%)L 7134 ( 8.48%)Q 3108 ( 3.69%)M 1710 ( 2.03%)H 1865 ( 2.22%)W 1236 ( 1.47%)F 3268 ( 3.88%)X 19 ( 0.02%)B 31 ( 0.04%)Z 14 ( 0.02%)
A m i n o a c i d D i s t r i b u t i o n
0
2
4
6
8
1 0
A m i n o a c i d s
C u l l / L e a r n i n g
C B 5 1 3
C u l l / L e a r n i n g 8 . 5 3 1 . 2 4 5 . 9 2 6 . 6 4 4 . 1 1 7 . 5 9 2 . 4 0 5 . 6 3 5 . 5 2 9 . 2 1 1 . 7 6 4 . 2 5 4 . 6 9 3 . 7 2 5 . 1 3 5 . 8 4 5 . 5 0 7 . 2 2 1 . 5 2 3 . 5 7
C B 5 1 3
8 . 6 4 1 . 6 4 5 . 9 1 6 . 0 0 3 . 8 8 7 . 9 1 2 . 2 2 5 . 5 2 5 . 9 2 8 . 4 8 2 . 0 3 4 . 7 3 4 . 6 4 3 . 6 9 4 . 5 3 6 . 2 1 5 . 9 6 6 . 8 9 1 . 4 7 3 . 6 4
A C D E F G H I K L M N P Q R S T V W Y
Neural Network - Input
• Position Specific Scoring Matrices, PSSM
A R N D C Q E G H I L K M F P S T W Y V
B H 2BEM.A 1 -4 -3 -2 -4 -6 -2 -3 -5 11 -6 -5 -3 -4 -4 -5 -3 -4 -5 -1 -6 A G 2BEM.A 2 -2 -5 -3 -4 -5 -4 -5 7 -5 -7 -6 -4 -5 -6 -5 -3 -4 -5 -6 -6 A Y 2BEM.A 3 -1 1 -4 -3 -5 -4 -4 -4 1 -4 -1 -4 -1 2 -5 0 -1 4 7 -2 A V 2BEM.A 4 -1 -5 -5 -6 -4 -4 -5 -5 -5 4 1 -5 6 -3 -2 -2 0 -5 -4 4 B E 2BEM.A 5 -2 -4 -3 0 -4 -1 3 -2 -4 0 -3 -2 1 -2 -3 3 3 -5 -4 0
4 time iterativ psi-blast against nr70
• Secondary Structure predictionsB H 2BEM.A 1 0.003 0.003 0.966A G 2BEM.A 2 0.018 0.086 0.868A Y 2BEM.A 3 0.020 0.199 0.752A V 2BEM.A 4 0.021 0.271 0.679B E 2BEM.A 5 0.020 0.199 0.752
(sec predictor by Pernille Andersen)
Secondary structure predictor
• Developed by Pernille Andersen, incorporated in NetSurfP
• Trained on 2,085 sequences using DSSP
- H = H, E = E, C = ., G, I, B, S and T
- H ~ 30 %, E ~ 20 %, C ~ 50 %
• Performance of ~80 %
• Maximum theoretical limit is ~88 %
Neural Network - Settings
• Window Size: 11-19
• Hidden units: 10, 20, 25, 30, 40, 50, 75, 150, (200)
• Learning rate: 0.01 / (0.005)
• Epocs (training rounds): 200
• 10-fold cross-validation
- 9/10 used for training, 1/10 for testing
Neural network window
Sliding window of 7
170 2BEM.A mol:aa CHITIN-BINDING PROTEIN
HGYVESPASRAYQCKLQLNTQCGSVQYEPQSVEGLKGFPQAGPADGHIASADKSTFFELDQQTPTRWNKLNLKTGPNSFTWKLTARHSTTSWRYFITKPNWDASQPLTRASFDLTPFCQFNDGGAIPAAQVTHQCNIPADRSGSHVILAVWDIADTANAFYQAIDVNLSKBAAABBAAAAAAAABBBBABBABBAABBABAABABBBAABBBABBABAAAABBBBABAAABABBBAABABBABAABABAAABABBBBAABAAAAAAABBBABABBBAAABAABBBAAAAAABBBBBABBBABABABAABBABBBAAAAAAAAABBBBBAAAAAAAABABB
Prediction on middle residueSerine, buried
Neural network window
Sliding window of 7
170 2BEM.A mol:aa CHITIN-BINDING PROTEIN
HGYVESPASRAYQCKLQLNTQCGSVQYEPQSVEGLKGFPQAGPADGHIASADKSTFFELDQQTPTRWNKLNLKTGPNSFTWKLTARHSTTSWRYFITKPNWDASQPLTRASFDLTPFCQFNDGGAIPAAQVTHQCNIPADRSGSHVILAVWDIADTANAFYQAIDVNLSKBAAABBAAAAAAAABBBBABBABBAABBABAABABBBAABBBABBABAAAABBBBABAAABABBBAABABBABAABABAAABABBBBAABAAAAAAABBBABABBBAAABAABBBAAAAAABBBBBABBBABABABAABBABBBAAAAAAAAABBBBBAAAAAAAABABB
Prediction on middle residueProline, exposed
Neural network window
Sliding window of 7
170 2BEM.A mol:aa CHITIN-BINDING PROTEIN
HGYVESPASRAYQCKLQLNTQCGSVQYEPQSVEGLKGFPQAGPADGHIASADKSTFFELDQQTPTRWNKLNLKTGPNSFTWKLTARHSTTSWRYFITKPNWDASQPLTRASFDLTPFCQFNDGGAIPAAQVTHQCNIPADRSGSHVILAVWDIADTANAFYQAIDVNLSKBAAABBAAAAAAAABBBBABBABBAABBABAABABBBAABBBABBABAAAABBBBABAAABABBBAABABBABAABABAAABABBBBAABAAAAAAABBBABABBBAAABAABBBAAAAAABBBBBABBBABABABAABBABBBAAAAAAAAABBBBBAAAAAAAABABB
Prediction on middle residueAlanine, exposed
Method
• Error function:
• Z-score:
Wisdom of the crowdSelecting best performing network architectures based on test performance
Better than choosing any single network10-fold % correct predictions Average of set A-J w. sec. structure
79.55
79.66
79.69
79.72
79.75 79.75 79.7579.74
79.7579.76
79.7779.77
79.7679.75
79.7679.75 79.75
79.7679.77 79.77
79.40
79.45
79.50
79.55
79.60
79.65
79.70
79.75
79.80
Series1
Series1 79.55 79.66 79.69 79.72 79.75 79.75 79.75 79.74 79.75 79.76 79.77 79.77 79.76 79.75 79.76 79.75 79.75 79.76 79.77 79.77
Average of
top 1
Average of
top 2
Average of
top 3
Average of
top 4
Average of
top 5
Average of
top 6
Average of
top 7
Average of
top 8
Average of
top 9
Average of
top 10
Average of
top 11
Average of
top 12
Average of
top 13
Average of
top 14
Average of
top 15
Average of
top 16
Average of
top 17
Average of
top 18
Average of
top 19
Average of
top 20
Results - Classification networks
• Training: % Correct MCC#Networ
ks
Best Single Architecture
79.5 0.587 10
All Architectures 79.7 0.592 400
Top 20 Architectures
79.8 0.593 200
Results - Classification networks
• Training:
• Evaluation:
% Correct MCC#Networ
ks
Best Single Architecture
79.5 0.587 10
All Architectures 79.7 0.592 400
Top 20 Architectures
79.8 0.593 200
% Correct MCC
Dor and Zhou 78.8Not
Published
NetsurfP CB500/CB513
79.000
0.577
Results • Evaluation
NetSurfP
/usr/cbs/bio/src/NetSurfP/NetSurfP -h
NetSurfP
NetDiseaseSNP
• Disease-SNP prediction (Morten Bo Johansen)
• Without NetSurfP:Cross-validation: MCC= 0.569Cross-Evaluation: MCC= 0.560
• With NetSurfP:Cross-validation: MCC= 0.583Cross-Evaluation: MCC= 0.572
Paper is out..What then?
Statistics
• Submissions to the webserver from CBS website
Paper is out..What then?
Paper is out..What then?
Paper is out..What then?
33161 sequences submitted
First citation 24 october 2009 :-)