? peter smooker, heiko schröder, margaret hamilton, aditya, mannan, sundara, saravanan, rajalingam...
Post on 21-Dec-2015
217 views
TRANSCRIPT
?
Peter Smooker, Heiko Schröder, Margaret Hamilton, Aditya, Mannan, Sundara, Saravanan, Rajalingam Aravinthan,Gad Abraham, Abdullah Al Amin, Nalinda, Prashant
A new approach to protein structure prediction
?
?
?
?
?
What’s on today?•Predicting protein structures•Fast implementation•Special purpose HPC•Searching for structural similarity•Visualisation of proteins
Lots of speculation, some results!
Aim: Prediction of protein structures
Common methods:•Homology modelling – > 30% match similar fold•Molecular modelling – only for small molecules•Crystallography very expensive, very slow and not always possible.
Only few structures are known and we are falling behind (<1%).Major efforts are being made: e.g. Blue-Gene (fastest supercomputer (IBM))
Linear time method?
• Genetic sequence databases are growing exponentially (maybe not?)
• Growth rate will continue, since multiple concurrent genome projects have begun, with more to come
Motivation
120%
45%
15%
Full Genome Comparison
• related Organisms, but Tuberculosis causes a disease find common and different parts
• 16106 pair-wise sequence comparisons
• More clever ways? – I guess!
• Many Genome-Genome Comparisons will be required in the near future
3918 ProteinSequences1.329.298
AminoAcids
4289 ProteinSequences1.359.008
AminoAcids
Homology Modeling• Discovered sequences are analyzed by comparison with
databases• Complexity of sequence comparison is proportional to the
product of query size times database size Analysis too slow on sequential computersAnalysis too slow on sequential computers• Two possible approaches
– HeuristicsHeuristics, e.g. BLAST,FastA, but the more efficient the heuristics, the worse the quality of the results
– Parallel ProcessingParallel Processing, get high-quality results in reasonable time
Protein Sequence Alignment
• BLAST, FastA, Smith-Waterman
GGHSRLILSQLGEEG.RLLAIDRDPQAIAVAKT....IDDPRFSII
GGHAERFL.E.GLPGLRLIGLDRDPTALDVARSRLVRFAD.RLTLV|||::::| : |::| ||:::||||:|:|||:: ::| |::::
BLAST
FastA
Smith-Waterman
Slower
Faster
SearchSpeed
DataQuality
Lower Higher
T=O(|S|)
Smith-Waterman AlgorithmAlign S1=ATCTCGTATGATGATCTCGTATGATG S2=GTCTATCACGTCTATCAC
GTCTATCAC
A T C T C G T A T G A T G
0 0 0 0 0 2 1 0 0 2 1 00000000000
0 0 0 0 0 0 0 0 0 0 0 0 02
0 2 1 2 1 1 4 3 2 1 1 3 20021021
1224321
4323654
3654554
4554657
3444556
3546545
3475576
2569876
1458876
03677
109
2258799
2147788
108
97
534
2
0
else 1
)( if 2),(
yxyxSbt
=1, =1
A T C T C G T A T G A T GA T C T C G T A T G A T G
G T C G T C T A T C A CT A T C A C
)2,1()1,1(
1)1,(
1),1(
0
max),(
ji SSSbtjiH
jiH
jiHjiH
Protein folding
Our approach:•Linear method – we do not compute electromagnetic fields nature has done it for us!•Physical forces have short range (decreasing
quadratic with the distance)
→ context sensitivity: Find the same protein with the same context in the database – copy that structure.
Dihedral Angles• The 6 atoms in each peptide unit lie in the same plane
φ and are free to rotate
• The structure of a protein is almost totally determined, if all angles φ and are known
Ramachandran Plots # choices
ALAARGASN GLN CYS
GLYHISASP GLU LYS
PROILELEU PHE MET
SERTHRTRP VAL TYR
3 4? 3
5
222
2Abdullah Al Amin
Complexity – Reducing the size of search spaceReducing the number of peaks.
2x size of search space2X-Y assuming we have predicted Y angles with high confidence
Our aim: Large Y (Y=X is not possible)
Method: Increase the contextProblem: Longer the context → fewer matches
Example: 20k different sequences of length k.Ek =|PDB|/20k.k=3, E3 =1000. k=5, E5 =3. k=9, E9 =1/50000.
•I I O ALA LYS SER O O I (E=20) reduce number of peaks
•Different lists for different groups of proteins?
(inside cells, outside cells), Saravanan
reduce number of peaks
•Short and perfect to longer and less perfect?
Rajalingam Aravinthan, Gad Abraham
reduce number of peaks
Reduce the size of the search space!
Hydrophobic (O)
Hydrophil (I)
Which context??
Suffix trie for abcacbcabacb (all suffixes up to length 4).
c
a
ba
cb c ba
ca cbb c
c c
a
b c a b
c
a
Find all strings that are similar to aacb (tolerance 1).
0 1
a
Breadth first search!
1
1
1 1
1 1 1
1
1
0
b
1
Suffix trie and suffix tree – fast search!
a
a
c
b
Prashant
Parallel Architectures for Bioinformatics
• Embedded Massively Parallel Accelerators– Systola 1024: PC add-on board with 1024
processors (ISATEC, Germany)
– Fuzion 150: 1536 processors on a single chip (Clearspeed Technology, UK)
– FPGA ?
Parallel Architectures for Bioinformatics
High speed Myrinet switchHigh speed Myrinet switch
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
Systola1024
– Supercomputer performance at low cost– combines SIMD and MIMD paradigm within a parallel architecture Hybrid ComputerHybrid Computer
Speculation:Finding similar structures based on sequences of φs and s.
We could search for a structure that has a high degree of similarity with a predicted structure (instead of similarity of the sequence – particularly in hydrophobic parts).
Modify Smith-Waterman:What should be the penalty for gaps (do gaps make any sense?) – how do we treat confidence information?
Smith-Waterman AlgorithmAlign S1=ATCTCGTATGATGATCTCGTATGATG S2=GTCTATCACGTCTATCAC
=1, =1
A T C T C G T A T G A T GA T C T C G T A T G A T G
G T C G T C T A T C A CT A T C A C
else 1
)( if 2),(
yxyxSbt 0 0 0 0 0 2 1 0 0 2 1 0
0000000000
0 0 0 0 0 0 0 0 0 0 0 0 02
0 2 1 2 1 1 4 3 2 1 1 3 20021021
1224321
4323654
3654554
4554657
3444556
3546545
3475576
2569876
1458876
03677
109
2258799
2147788
108
97
534
2
0GTCTATCAC
A T C T C G T A T G A T GH
function
1
)2,1()1,1(
1)1,(
),1(
0
max),(
ji SSSbtjiH
jiH
jiHjiH
???
-10
-5
0
5
10
15
20
0 10 20 30 40 50 60
Score of 2nd power
Score of 3rd power
Score of 4th power
degrees difference
Nalinda
1500Score = ------------------------- - 10
50 + (| ai – aj | x 0.9)2
= = -10
0
5
10
15
20
25
Score from CE @ PDB
Sequence Similarity ( scaled down by a factor of 5 )
Normalised Score
Nalinda
Visualisation toolSequence of dihedral anglesStructure of proteinVisualise structureIndicate confidenceTranslate change of dihedral angle into change of 3D-structureEmphasise physical collisionsShow positions for potential S-S bonds and hydrogen bonds Show fields?
Speculation:Simulation of the folding process:•Predict the structure of the following hydrophobic subsequence – needs to be tested whether hydrophobicity is highly correlated with being “inside a protein”.•Mark all positions of cysteines•Mark all positions of potential hydrogen bonds•Simulate the bending process•Look for similar structures “up to here similar”•Compare structures of identical O/I sequences•Compare surfaces (cut protein at a hydrophil position and look at the set of exposed hydrophobic amino acids)•Develop an algorithm to determine structural similarity, either based on dihedral angles or on Euclidian positions using dynamic programming.•With such an algorithm similar “surroundings” can be found.•Do new parts deform old parts significantly?