neural networks for protein structure prediction brown, jmb 1999 cs 466 saurabh sinha

35
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Upload: randell-cain

Post on 16-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Neural Networks for Protein Structure PredictionBrown, JMB 1999

CS 466

Saurabh Sinha

Page 2: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Outline

• Goal is to predict “secondary structure” of a protein from its sequence

• Artificial Neural Network used for this task

• Evaluation of prediction accuracy

Page 3: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

What is Protein Structure?

Page 4: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

http://academ

ic.brooklyn.cuny.edu/biology/bio4fv/page/3d_prot.htm

Page 5: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

http://ma

tcmadison.edu/bio

tech/resources/proteins/labManua

l/image

s/220_04_11

4.png

Page 6: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Protein Structure

• An amino acid sequence “folds” into a complex 3-D structure

• Finding out this 3-D structure is a crucial and challenging task

• Experimental methods (e.g., X-ray crystallography) are very tedious

• Computational predictions are a possibility, but very difficult

Page 7: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

What is “secondary structure”?

Page 8: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

http://www.wiley.com/college/pratt/0471393878/student/structure/secondary_structure/secondary_structure.gif

“Strand” “Helix”

Page 9: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

http://www.npaci.edu/features/00/Mar/protein.jpg

“Strand”

“Helix”

Page 10: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Secondary structure prediction

• Well, the whole 3-D “tertiary” protein structure may be hard to predict from sequence

• But can we at least predict the secondary structural elements such as “strand”, “helix” or “coil”?

• This is what this paper does• .. and so do many other papers (it is a hard

problem !)

Page 11: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

A survey of structure prediction

• The most reliable technique is “comparative modeling”– Find a protein P whose amino acid sequence is

very similar to your “target” protein T– Hope that this other protein P does have a known

structure– Predict a similar structure similar to that of P, after

carefully considering how the sequences of P and T differ

Page 12: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

A survey of structure prediction

• Comparative modeling fails if we don’t have a suitable homologous “template” protein P for our protein T

• “Ab initio” tertiary methods attempt to predict the structure without using a protein structure– Incorporate basic physical and chemical principles into the

structure calculation– Gets very hairy, and highly computationally intensive

• The other option is prediction of secondary structure only (i.e., making the goal more modest)– These may be used to provide constraints for tertiary

structure prediction

Page 13: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Secondary structure prediction

• Early methods were based on stereochemical principles

• Later methods realized that we can do better if we use not only the one sequence T (our sequence), but also a family of “related sequences”

• Search for sequences similar to T, build a multiple alignment of these, and predict secondary structure from the multiple alignment of sequence

Page 14: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

What’s multiple alignment doing here ?

• Most conserved regions of a protein sequence are either functionally important or buried in the protein “core”

• More variable regions are usually on surface of the protein, – there are few constraints on what type of amino

acids have to be here (apart from bias towards hydrophilic residues)

• Multiple alignment tells us which portions are conserved and which are not

Page 15: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

http://bio.nagaokaut.ac.jp/~mbp-lab/img/hpc.png

hydrophobic core

Page 16: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

What’s multiple alignment doing here ?

• Therefore, by looking at multiple alignment, we could predict which residues are in the core of the protein and which are on the surface (“solvent accessibility”)

• Secondary structure then predicted by comparing the accessibility patterns associated with helices, strands etc.

• This approach (Benner & Gerloff) mostly manual

• Today’s paper suggest an automated method

Page 17: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

The PSI-PRED algorithm

• Given an amino-acid sequence, predict secondary structure elements in the protein

• Three stages:1. Generation of a sequence profile (the

“multiple alignment” step)2. Prediction of an initial secondary structure

(the neural network step)3. Filtering of the predicted structure (another

neural network step)

Page 18: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Generation of sequence profile

• A BLAST-like program called “PSI-BLAST” used for this step

• We saw BLAST earlier -- it is a fast way to find high scoring local alignments

• PSI-BLAST is an iterative approach– an initial scan of a protein database using the target

sequence T– align all matching sequences to construct a “sequence

profile”– scan the database using this new profile

• Can also pick out and align distantly related protein sequences for our target sequence T

Page 19: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

The sequence profile looks like this

• Has 20 x M numbers• The numbers are log likelihood of each residue at each position

Page 20: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Preparing for the second step

• Feed the sequence profile to an artificial neural network

• But before feeding, do a simply “scaling” to bring the numbers to 0-1 scale

x →1

1+ e−x

Page 21: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Intro to Neural nets (the second and third steps of

PSIPRED)

Page 22: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Artificial Neural Network

• Supervised learning algorithm• Training examples. Each example has a

label – “class” of the example, e.g., “positive” or

“negative”– “helix”, “strand”, or “coil”

• Learns how to predict the class of an example

Page 23: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Artificial Neural Network

• Directed graph

• Nodes or “units” or “neurons”

• Edges between units

• Each edge has a weight (not known a priori)

Page 24: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Layered Architecture

Input here is a four-dimensional vector. Each dimension goesinto one input unit

http://www.akri.org/cognition/images/annet2.gif

Page 25: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Layered Architecturehttp://www.geocomputation.org/2000/GC016/GC016_01.GIF

(units)

Page 26: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

What a unit (neuron) does

• Unit i receives a total input xi from the units connected to it, and produces an output yi = fi(xi) where fi() is the “transfer function” of unit i

x i = wij y j + wij∈N−{i}

y i = f i(x i) = f i wij y j + wij∈N−{i}

∑ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

wi is called the “bias” of the unit

Page 27: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Weights, bias and transfer function

Unit takes n inputsEach input edge has weight wi

Bias bOutput a

Transfer function f()Linear, Sigmoidal, or other

Page 28: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Weights, bias and transfer function

• Weights wij and bias wi of each unit are “parameters” of the ANN.– Parameter values are learned from input data

• Transfer function is usually the same for every unit in the same layer

• Graphical architecture (connectivity) is decided by you. – Could use fully connected architecture: all units in

one layer connect to all units in “next” layer

Page 29: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Where’s the algorithm?

• It’s in the training of parameters !• Given several examples and their labels: the

training data• Search for parameter values such that output

units make correct predictions on the training examples

• “Back-propagation” algorithm – Read up more on neural nets if you are interested

Page 30: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Back to PSIPRED …

Page 31: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Step 2• Feed the sequence profile to the input layer of an

ANN• Not the whole profile, only a window of 15

consecutive positions• For each position, there are 20 numbers in the profile

(one for each amino acid)• Therefore ~ 15 x 20 = 300 numbers fed• Therefore, ~ 300 “input units” in ANN• 3 output units, for “strand”, “helix”, “coil”

– each number is confidence in that secondary structure for the central position in the window of 15

Page 32: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

15

Input layer Hidden layer

helix

strand

coil

e.g.,

0.18

0.09

0.67

Page 33: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Step 3

• Feed the output of 1st ANN to the 2nd ANN• Each window of 15 positions gave 3

numbers from the 1st ANN• Take 15 successive windows’ outputs and

feed them to 2nd ANN• Therefore, ~ 15 x 3 = 45 input units in ANN• 3 output units, for “strand”, “helix”, “coil”

Page 34: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Test of performance

Page 35: Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha

Cross-validation• Partition the training data into “training set” (two

thirds of the examples) and “test set” (remaining one third)

• Train PSIPRED on training set, test predictions and compare with known answers on test set.

• What is an answer? – For each position of sequence, a prediction of what

secondary structure that position is involved in– That is, a sequence over “H/S/C” (helix/strand/coil)

• How to compare answer with known answer?– Number of positions that match