? peter smooker, heiko schröder, margaret hamilton, aditya, mannan, sundara, saravanan, rajalingam...

?

Peter Smooker, Heiko Schröder, Margaret Hamilton, Aditya, Mannan, Sundara, Saravanan, Rajalingam Aravinthan,Gad Abraham, Abdullah Al Amin, Nalinda, Prashant

A new approach to protein structure prediction

?

?

?

?

?

What’s on today?•Predicting protein structures•Fast implementation•Special purpose HPC•Searching for structural similarity•Visualisation of proteins

Lots of speculation, some results!

Aim: Prediction of protein structures

Common methods:•Homology modelling – > 30% match similar fold•Molecular modelling – only for small molecules•Crystallography very expensive, very slow and not always possible.

Only few structures are known and we are falling behind (<1%).Major efforts are being made: e.g. Blue-Gene (fastest supercomputer (IBM))

Linear time method?

• Genetic sequence databases are growing exponentially (maybe not?)

• Growth rate will continue, since multiple concurrent genome projects have begun, with more to come

Motivation

120%

45%

15%

Full Genome Comparison

• related Organisms, but Tuberculosis causes a disease find common and different parts

• 16106 pair-wise sequence comparisons

• More clever ways? – I guess!

• Many Genome-Genome Comparisons will be required in the near future

3918 ProteinSequences1.329.298

AminoAcids

4289 ProteinSequences1.359.008

AminoAcids

Homology Modeling• Discovered sequences are analyzed by comparison with

databases• Complexity of sequence comparison is proportional to the

product of query size times database size Analysis too slow on sequential computersAnalysis too slow on sequential computers• Two possible approaches

– HeuristicsHeuristics, e.g. BLAST,FastA, but the more efficient the heuristics, the worse the quality of the results

– Parallel ProcessingParallel Processing, get high-quality results in reasonable time

Protein Sequence Alignment

• BLAST, FastA, Smith-Waterman

GGHSRLILSQLGEEG.RLLAIDRDPQAIAVAKT....IDDPRFSII

GGHAERFL.E.GLPGLRLIGLDRDPTALDVARSRLVRFAD.RLTLV|||::::| : |::| ||:::||||:|:|||:: ::| |::::

BLAST

FastA

Smith-Waterman

Slower

Faster

SearchSpeed

DataQuality

Lower Higher

T=O(|S|)

Smith-Waterman AlgorithmAlign S1=ATCTCGTATGATGATCTCGTATGATG S2=GTCTATCACGTCTATCAC

GTCTATCAC

A T C T C G T A T G A T G

0 0 0 0 0 2 1 0 0 2 1 00000000000

0 0 0 0 0 0 0 0 0 0 0 0 02

0 2 1 2 1 1 4 3 2 1 1 3 20021021

1224321

4323654

3654554

4554657

3444556

3546545

3475576

2569876

1458876

03677

109

2258799

2147788

108

97

534

2

0

else 1

)( if 2),(

yxyxSbt

=1, =1

A T C T C G T A T G A T GA T C T C G T A T G A T G

G T C G T C T A T C A CT A T C A C

)2,1()1,1(

1)1,(

1),1(

0

max),(

ji SSSbtjiH

jiH

jiHjiH

Context sensitivity!

Protein folding

Our approach:•Linear method – we do not compute electromagnetic fields nature has done it for us!•Physical forces have short range (decreasing

quadratic with the distance)

→ context sensitivity: Find the same protein with the same context in the database – copy that structure.

Dihedral Angles• The 6 atoms in each peptide unit lie in the same plane

φ and are free to rotate

• The structure of a protein is almost totally determined, if all angles φ and are known

φ

Abdullah Al Amin

Abdullah Al Amin

Ramachandran Plots # choices

ALAARGASN GLN CYS

GLYHISASP GLU LYS

PROILELEU PHE MET

SERTHRTRP VAL TYR

3 4? 3

5

222

2Abdullah Al Amin

val-’val-ile

val-’val-valval-’val-asn

Σ val-’val-xxx

Which φ ? Abdullah Al Amin

Abdullah Al Amin

φ

val-val-ala

φ → same AAφ → neighbour

Abdullah Al Amin

GLU-CYS’-SER GLU-’CYS-SER φ

GLU-CYS’-ALA GLU-’CYS-ALA φ

confidence

# peaks?

Abdullah Al Amin

Complexity – Reducing the size of search spaceReducing the number of peaks.

2x size of search space2X-Y assuming we have predicted Y angles with high confidence

Our aim: Large Y (Y=X is not possible)

Method: Increase the contextProblem: Longer the context → fewer matches

Example: 20k different sequences of length k.Ek =|PDB|/20k.k=3, E3 =1000. k=5, E5 =3. k=9, E9 =1/50000.

•I I O ALA LYS SER O O I (E=20) reduce number of peaks

•Different lists for different groups of proteins?

(inside cells, outside cells), Saravanan

reduce number of peaks

•Short and perfect to longer and less perfect?

Rajalingam Aravinthan, Gad Abraham

reduce number of peaks

Reduce the size of the search space!

Hydrophobic (O)

Hydrophil (I)

Which context??

13

7

9

3

Rajalingam AravinthanGad Abraham

Prediction based on length 3

φ -Helix

Abdullah Al Amin

Why 9?

Suffix trie for abcacbcabacb (all suffixes up to length 4).

c

a

ba

cb c ba

ca cbb c

c c

a

b c a b

c

a

Find all strings that are similar to aacb (tolerance 1).

0 1

a

Breadth first search!

1

1

1 1

1 1 1

1

1

0

b

1

Suffix trie and suffix tree – fast search!

a

a

c

b

Prashant

Parallel Architectures for Bioinformatics

• Embedded Massively Parallel Accelerators– Systola 1024: PC add-on board with 1024

processors (ISATEC, Germany)

– Fuzion 150: 1536 processors on a single chip (Clearspeed Technology, UK)

– FPGA ?

Parallel Architectures for Bioinformatics

High speed Myrinet switchHigh speed Myrinet switch

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

Systola1024

– Supercomputer performance at low cost– combines SIMD and MIMD paradigm within a parallel architecture Hybrid ComputerHybrid Computer

Speculation:Finding similar structures based on sequences of φs and s.

We could search for a structure that has a high degree of similarity with a predicted structure (instead of similarity of the sequence – particularly in hydrophobic parts).

Modify Smith-Waterman:What should be the penalty for gaps (do gaps make any sense?) – how do we treat confidence information?

Smith-Waterman AlgorithmAlign S1=ATCTCGTATGATGATCTCGTATGATG S2=GTCTATCACGTCTATCAC

=1, =1

A T C T C G T A T G A T GA T C T C G T A T G A T G

G T C G T C T A T C A CT A T C A C

else 1

)( if 2),(

yxyxSbt 0 0 0 0 0 2 1 0 0 2 1 0

0000000000

0 0 0 0 0 0 0 0 0 0 0 0 02

0 2 1 2 1 1 4 3 2 1 1 3 20021021

1224321

4323654

3654554

4554657

3444556

3546545

3475576

2569876

1458876

03677

109

2258799

2147788

108

97

534

2

0GTCTATCAC

A T C T C G T A T G A T GH

function

1

)2,1()1,1(

1)1,(

),1(

0

max),(

ji SSSbtjiH

jiH

jiHjiH

???

-10

-5

0

5

10

15

20

0 10 20 30 40 50 60

Score of 2nd power

Score of 3rd power

Score of 4th power

degrees difference

Nalinda

1500Score = ------------------------- - 10

50 + (| ai – aj | x 0.9)2

= = -10

0

5

10

15

20

25

Score from CE @ PDB

Sequence Similarity ( scaled down by a factor of 5 )

Normalised Score

Nalinda

Look ahead

Visualisation toolSequence of dihedral anglesStructure of proteinVisualise structureIndicate confidenceTranslate change of dihedral angle into change of 3D-structureEmphasise physical collisionsShow positions for potential S-S bonds and hydrogen bonds Show fields?

Speculation:Simulation of the folding process:•Predict the structure of the following hydrophobic subsequence – needs to be tested whether hydrophobicity is highly correlated with being “inside a protein”.•Mark all positions of cysteines•Mark all positions of potential hydrogen bonds•Simulate the bending process•Look for similar structures “up to here similar”•Compare structures of identical O/I sequences•Compare surfaces (cut protein at a hydrophil position and look at the set of exposed hydrophobic amino acids)•Develop an algorithm to determine structural similarity, either based on dihedral angles or on Euclidian positions using dynamic programming.•With such an algorithm similar “surroundings” can be found.•Do new parts deform old parts significantly?

?

? ?

?

?

?

?

?

? ?

Thank you !

? peter smooker, heiko schröder, margaret hamilton, aditya, mannan, sundara, saravanan, rajalingam...

Documents

protein structure prediction

homology modelling

gad abraham

abdullah al amin

margaret hamilton

rajalingam aravinthan

new approach

heiko schrder