bio chap notes
TRANSCRIPT
-
8/8/2019 Bio Chap Notes
1/27
NOTES
Protein Docking
Definition:
The process of predicting the structure of complexes formed by protein-protein and protein-ligand
interaction is called docking
Introductory Details
IO
Given two biological molecules determine:
Whether the two molecules interact
- is there an energetically favorable orientation of the
two molecules such that one may modify the others
function
- do the two molecules fit together in any energetically
favorable way
If so, what is the orientation that maximizes the
interaction while minimizing the total energy of the
complex
Importance
DEBAAP
Design of novel pharmaceuticals, drugs, agricultural and biological products.
Essential Roles played by proteins
Bio-Molecular interactions are the core of all Biological Processes
Altering behavior of Protein
Act as Switches
-
8/8/2019 Bio Chap Notes
2/27
Protein Engineering
Difficulty
HT
Both molecules are flexible and may alter each others structure as they interact:
Hundreds to thousands of degrees of freedom
Total possible conformations are astronomical
Types of Docking
RF
Rigid Docking
Both the protein and ligand are considered as rigid molecules
Six Degrees of Freedom
Software: GRAMM
Flexible Docking
Either the protein or the ligand or both are considered partially or fully flexible
Search space vast
NEED FOR FLEXIBLE DOCKING
MSLT
It has been observed that when two molecules dock, the movement of atoms may involve
main chains
side chains
large changes in the interface upon complex formation
The atom movements are also observed in the exposed non-interface residues caused
by flexibility and disorder.
-
8/8/2019 Bio Chap Notes
3/27
Global Optimization of Potential Energy Function
Need and Details
Final conformation of the docked complex is the one with the minimum energy
The multidimensional energy surface has multiple minima problem.
Need for new specialized global search strategies
Examples
SD
S:SG
D:A
Statistical / Stochastic
Simulated Annealing
Genetic Algorithms
Deterministic
Alpha Branch Bound
Docking Methods
Types of Flexible docking models
RPH
Rigid Protein and Flexible Ligand
Partially Flexible Protein and Flexible Ligand
Hinge Bending in Protein and Ligand
-
8/8/2019 Bio Chap Notes
4/27
Rigid Protein Flexible Ligand
Method 1:
Details:
Finds optimal binding site in substrate by exploring various binding sites and
ligand conformations using
Simulated Annealing,
Genetic Algorithms
and Local Search.
Software:
AutoDock
Drawbacks:
Only considers ligand flexibility and hence cannot predict the conformational
changes in the receptor Protein
The selection of flexible torsional angles in the ligand has to be done by the user
Method 2: DSD D:UPU S:F - D:AG
Details:
Use incremental construction algorithm
Place an anchor fragment of ligand in the binding site
Uses greedy algorithm to add fragments and complete the ligand structure
Software
FlexX
Drawbacks:
Anchor Fragment selection is difficult
Greedy algorithm propagates errors resulting from initial bad choices
-
8/8/2019 Bio Chap Notes
5/27
Partial Flexible Protein - Flexible Ligand
Details:
Induced Fit Model: BF
Both Protein and Ligand are flexible and undergo a conformational change upon
binding to form a minimum energy perfect-fit
Full Protein flexibility is computationally intractable
Method 1: DSD D:RPC S:I D:AI
Details:
Internal Coordinates Mechanism RPC
Partial protein flexibility in the side chains of the active site
Randomly changes the position of ligand followed by energy
minimization
Comparative assessment with autodock and flexX showed that ICM
provided highest accuracy
Software:
ICM(Internal Coordinates Mechanism)
Drawbacks AI
Allows partial protein flexibility in the side chains of the active site only
Ignores any changes in the backbone atoms or atoms in other regions
-
8/8/2019 Bio Chap Notes
6/27
Protein Folding
Definition
Protein folding is the physical process by which a one dimensional chain of amino acids folds
into its characteristic and functional three-dimensional structure.
1D to 3D
Motivation for Protein 3D structure Prediction
Main Motivation:
The structure of the protein is directly related to the proteins functionality.
The reasons for research of 3D structure are: MAID
Medicine:
Understanding biological functions. Binding and unbinding of proteins
constitute much of the cellular activity of living organisms.
Agriculture:
Genetic engineering of better and richer crops.
Industry:
Synthesis of enzymes.
Drug Design:
Finding targets for docking drugs.
What determines Fold?
in general, the amino-acid sequence of a protein determines the 3D shape of a protein
but some exceptions AS
all proteins can be denatured
-
8/8/2019 Bio Chap Notes
7/27
some molecules have multiple conformations
Difficulty: Is Problem Hard?
Yes. HD
Huge Search Space:
Difficult to decide criterion for correct fold
Solvable?
Yes, Nature Zindabad
Thus
1. Nature must not sample all conformations
2. Nature knows the correct criteria
Folding Method Criterion
The method of choice for folding a given protein depends on
existence of a similar protein whose structure is already known, and
on the extent of such similarity
Methods of Protein Prediction
1. Comparative Modeling or Homology Modeling
2. Ab initio prediction (from scratch)
3. Fold Recognition or threading (fitting a sequence to existing models)
-
8/8/2019 Bio Chap Notes
8/27
Homology or Comparative Modeling
Introduction
Comparative modeling allows us to build a 3-D model for a protein of known aa
sequence but un-known structure, using another protein of known sequence and
structure as a template
Steps: IADIBBRE
Identify Homologous/Parent(s) Sequences: SD
Search the target sequence against the sequences in the protein data bank
using
FASTA
or BLAST.
Distant homologs could be search by using PSI-Blast.
Align the target sequence with Parent(s) 3IA
If sequence identity between target and parent is >70% the alignment is
generally trivial
If identity is below 40% then alignment becomes very difficult
Alignment is done using Dynamic Programming as discussed earlier.
If more than one parents are identified then it is better to Multiple align the
parents structurally first then calculate a MSA and then align the targetsequence to that Multiple Alignment.
-
8/8/2019 Bio Chap Notes
9/27
Determine SCRs and SVRs
Multiple Parents
If multiple parents are present then those regions that are same in all
parents are termed as SCRs or core regions and other variable regions
are termed as SVRs
Single Parent
Whereas if one parent is present then initially we assume that all
regions are SCRs except where indels occur. If even one indel occurs in a
loop region then whole of the loop is considered as SVR.
Inherit the SCRs from the parent(s)
The SCR is copied from the parent (s) to use in the model.
Build the SVRs
SVRs are almost always loop regions
When these vary in length from the loops present in the parent
structure(s), they will be built with lower accuracy.
Even where loop regions are conserved in length they can adopt quite
different conformations.
Build the sidechains
Rotamer Library: (Developing Bioinformatics Computer Skills by Cynthia Gibas,
Ref: pg261)
-
8/8/2019 Bio Chap Notes
10/27
These contain information about allowed rotations of the remote amino acid
side chain atoms
Refine Model
Usually through (EM) Energy Minimization
Standard mathematical minimization algorithms
EM is only able to move atoms such that a local minimum on the energy surface
is found.
Evaluate Errors in the model
RMSD (Root Mean Square Deviation) is used to say how similar is a structure to
another.
R= (di)2/N
di: distance between each pair of atom
Exact String Matching
Applications:
it has applications in many fields, as:
Text searching
Molecular biology
Data compression
and so on
-
8/8/2019 Bio Chap Notes
11/27
Motivation
Recognizing DNA Contamination
Given a string S1( the newly isolated and sequenced string of DNA) and a known string S2 ( the
combined sources of possible contamination), find all substrings of S2 that occur in S1 and are
longer than some given length.
These substrings are candidates of unwanted pieces of S2 that have contaminated the desired
DNA string.
Z-Algorithm
Given string S = {aabadaabcaaba}
Step 1: number alphabets
a a b a d a a b c a a b a
1 2 3 4 5 6 7 8 9 10 11 12 13
Step 2: Calculate Zs
Z2=1{a..a} , Z3= 0, Z4=1{a..a}, Z5=0, Z6=3{aab..aab}, Z7=1{a..a}, Z8=0, Z9=0, Z10=4, Z11=1, Z12=0, Z13=1
Step 3: Draw Z boxes
| | |---------| |--------------|
a a b a d a a b c a a b a
1 2 3 4 5 6 7 8 9 10 11 12 13
Step 4: obtain li and ri values:
i 1 2 3 4 5 6 7 8 9 10 11 12 13
Li - 2 2 4 4 6 6 6 6 10 10 10 10
Ri - 2 2 4 4 8 8 8 8 13 13 13 13
-
8/8/2019 Bio Chap Notes
12/27
How to find small pattern P in T
Let $ be a character found neither in P nor in T.
Build the string S = P$T.
String S has length m+n+1, where m n.
Run the Z algorithm on S.
The resulting values Zi hold the property that Zi m. This is due to the presence of character $ in
S.
CASES
Case 1: k > r : (k is outside a z-box)
Case 2: k r: ( k is inside a z-box)
Case 2a: Zk < |B|
Z-box at k is the same as k
Hence, Zk is set to Zk
Values of r and l remain unchanged
Case 2b: Zk >= |B|
The algorithm will start searching for a first mismatch starting from position r+1
Let q be the position of that mismatch
Zk is set to (q-1) k+1 = q-k
R is set to q-1 and l is set to k
-
8/8/2019 Bio Chap Notes
13/27
Hidden Markov Model
Definitions
A Markov chain is a random process with the property that the next state depends only on the
current state.
A hidden Markov model (HMM) is a statisticalMarkov model in which the system being
modeled is assumed to be a Markov process with unobserved (hidden) states.
Each state has a probability distribution over the possible output tokens. Therefore the
sequence of tokens generated by an HMM gives some information about the sequence of
states.
N
ote that the adjective 'hidden' refers to the state sequence through which the model passes,not to the parameters of the model; even if the model parameters are known exactly, the model
is still 'hidden'.
Points:
n a regular Markov model, the state is directly visible to the observer, and therefore the state
transition probabilities are the only parameters.
In a hidden Markov model, the state is not directly visible, but output, dependent on the state, is
visible.
Calculate value of viterbi
V (k)= e(of output) * max(V(prev row&col)* Transition from k to l)
Note: jahan maximum value aa rahi hai waha say direction lau gaye
-
8/8/2019 Bio Chap Notes
14/27
Viterbi Algorithm
Regular Expression to HMM
State HMM
-
8/8/2019 Bio Chap Notes
15/27
Profile HMM
A profile HMM can be obtained from a multiple alignment and can be used for searching a
database for other members of the family in the alignment very much like standard profiles
Parts
The bottom line of states are called the main states, because they model the columns ofthe alignment.
In these states the probability distribution is just the frequency of the amino acids or
nucleotides as in the previous model of the DNA motif.
The second row of diamond shaped states are called insert states and are used to model
highly variable regions in the alignment.
They function exactly like the top state in the previous example.
The top line of circular states are called delete states.
These are a different type of state, called a silent or null state.
They do not match any residues,
they make it possible to jump over one or more columns in the alignment, to model the
situation when just a few of the sequences have a in the multiple alignment at a
position.
-
8/8/2019 Bio Chap Notes
16/27
Genetic Algorithm
Gene: A specific sequence of nucleotide bases, that carry information required for constructing proteins.
Proteins: Provide structural components of cells, tissues and enzymes for essential biochemicalreactions.
Overview
y A class of probabilistic optimized search algorithms based on the mechanics of natural selection
and natural genetics.
y Inspired by the biological evolution process. Dinosaurs are dead, cockroaches still surviving.
y Based on the survival of the fittest among string structures with a structured yet randomized
exchange to form search algorithm.
What is Genetic Algorithm
A genetic algorithm maintains a population of candidate solutions for the problem at hand, and makes it
evolve by iteratively applying a set of stochastic operators.
GA vs. Normal Optimization
y GAs work with a coding of the parameter set, not the parameters themselves.
y
GAs search from a population of points, not a single point.y GAs use payoff (objective function) information, not derivatives or other auxiliary knowledge.
y GAs use probabilistic transition rules, not deterministic rules.
NatureGenetic Algorithms
EnvironmentOptimization problem
Individuals living in that environmentFeasible solutions
Individuals degree of adaptation to its surrounding
environment
Solutions quality (fitness function)
-
8/8/2019 Bio Chap Notes
17/27
A population of organisms (species)A set of feasible solutions
Selection, recombination and mutation in natures
evolutionary process
Stochastic operators
Evolution of populations to suit their environmentIteratively applying a set of stochastic
operators on a set of feasible solutions
Limitations and weakness
y The idea behind GA is extremely appealing, but has the following limitations:
y It is quite unnatural to model most applications in terms of genetic operators like mutation and
crossover on bit strings.
y The pseudo biology adds another level of complexity between you and your problem.
y Their weakness is that the process of selection alone is too systematic and predictable, not like
creativity as we know it.
y Binary representations are limited in their operations and for certain problems alternative
operators and representations must be used.
y The cross over and mutations make no use of real problem structure, so large fractions of
transitions lead to inferior solutions, and convergence is slow.y GAs take a very long time on non-trivial problems as they generally require more objective
function evaluations as compared to classical optimization techniques. This being a major
practical limitation.
y Analogy with evolution is appropriate, but it took millions of years to achieve significant
improvement. Can we afford to wait that long?
-
8/8/2019 Bio Chap Notes
18/27
Iterative search
1. Select Current solution
2. Create new solution
3. Check whether solution met criterion desired4. If not repeat (1) And (2) again
5. Else stop
Neighborhood Methods
y Same as iterative search
y Steps (Descend method):
o Feasible solution i
o N(i) is a set of all neighbors near to the solution i
o J solution is found in I, where f(j)f(i)search stops
Simulated Annealing
Intro to Annealing
y Steps:
o Heat to desired temperature
o Holding at that temperature
o Cooling to room temperature
y Sequences of time and temperature
o Annealing schedules
o Cooling schedules
y Two way are these schedules Critical:
o Difference in Outside(Temp) and In(Temp)
Causes temperature gradients and internal stress
-
8/8/2019 Bio Chap Notes
19/27
Hence crack
o Actual annealing time should be long enough for transform to take place
Intro to Simulated Annealing
y Same as neighborhood method
y Problem with global optima descend method:
y Get stuck in local minima due to f(j)>f(i)
y Generating Distribution
o Generates possible valleys or states to be explored
y Accepting distribution
o Difference between function value of the present generated valley to be explored and
the last saved lowest valley
y For certain NP-hard problems where local extrema is mostly reached and global better extrema
is left
o Simulated annealing outperforms them
o It does by straightforward iterative improvement
o Tradeoff: longer running times
Simulated Annealing
y Boltzmann Probability Factor:
o p = exp ( -Hf / T)
where Hf is the increase in f and
T is a control parameter.
-
8/8/2019 Bio Chap Notes
20/27
y Requirements for SA :
o A representation of possible solutions
o A generator of random changes in solutions
o A means of evaluating the problem functions
o Annealing schedule
y Method :
1. Input and assess Initial solution
2. Estimate initial temperature
A suitable To is one that results in an average increase of acceptance probability
po of about 0.8.
The value of To will clearly depend on fand, hence, problem-specific.
To estimated by conducting an initial search in which all increases are accepted
and calculating the average increase in fobserved Hf
+
. To = - Hf
+/ ln(po)
3. Generate new solution
4. Assess initial temperature
5. Check whether to Accept new solution
6. If yes, then update stores
7. If no, skip (6)
8. Adjusted temperature
9. Check whether to Terminate search
10. If (9) yes STOP
11.Else again start from step (3)
y Acceptance of search steps (Metropolitan Criterion):
o Assume the performance change in the search direction
o Always accept a descending steps
o Accept a ascending step only if it pass a random test:
y Cooling Schedule
o T, annealing temperature, is the parameter that control the frequency of acceptance of
ascending steps
o We gradually reduce temperature T(k)
o At each temperature search is allowed for a certain number of steps
o The choice of parameters {T(k), and L(k)} are called cooling schedule
? 1,0exp randomT "(
-
8/8/2019 Bio Chap Notes
21/27
o EX:
Set L = n, the number of variables in the problem.
Set T(0) such that exp(-(/T(0)) } 1.
Set T(k+1) = EyT(k), where E is a constant smaller but close to 1.
y Algorithm
y Strengths of SA
o Simulated annealing can deal with highly nonlinear models and noisy data and many
constraints.
o It is a robust and general technique.
o Its main advantages over other local search methods are its flexibility and its ability to
approach global optimality.
o The algorithm is quite versatile since it does not rely on any restrictive properties of the
model.
-
8/8/2019 Bio Chap Notes
22/27
Heuristic Algorithm
Intro
y Dynamic Programming
o Loose attraction when
Database size 109
o It takes too much time
y Alternatives:
o Hardware
Very fast
Cost expensive
o Distribute computing
Slow than hardware version but still fast Cost expensive
o Heuristics
Much faster than dynamic Programming
Heuristic Method
y Definition:
o A heuristic methodis an algorithm that gives only approximate solution to a givenproblem.
y Important Points:
o Sometimes we are not able to formally prove that this solution actually solves the
problem,
o heuristic methods are commonly used because they are much faster than exact
algorithms.
o In addition, this is a software based strategy, which is therefore relatively cheap and
available to any researcher.
-
8/8/2019 Bio Chap Notes
23/27
y Commonly used heuristics are based on the following observations:
o Even linear time complexity will be problematic when database size is huge (over 109).
o Preprocessing of the database is desirable, since numerous queries are run on an
infrequently updated database.
o Substitutions are much more likely than indels.
o We expect homologous sequences to contain a lot of segments with matches or
substitutions, but without indels and gaps.
o These segments can be used as starting points for further searching.
y
Heuristic methods:o FASTA
o BLAST
o BLAST2
FASTA
Intro
The FASTA algorithm is a heuristic method for string comparison.
It compares a query string against a single text string.
When searching the whole database for matches to a given query, we compare the query using
the FASTA algorithm to every string in the database.
Good local alignment is likely to have exact matching subsequences.
The algorithm uses this property and focuses on segments in which there will be an absolute
identity between the two compared strings.
We can use the alignment Dot-Plot matrix for finding these identical regions.
-
8/8/2019 Bio Chap Notes
24/27
Method (Steps):
1. Finding Hot Spots (Regions with exact match (ktup))
2. Finding 10 Best Diagonal Runs (No indels)
3. Evaluate Diagonal Runs (using substitution matrices, find Init1)
4. Combine Good Diagonal Runs (Init N (Max. Weight Path in the Graph))
5. Finding alternative local alignment (Dynamic Programming in a Band (opt score))
6. Ranking
Small Definitions:
y Hot Spots:
o The first step of the algorithm is to determine all exact matches of length k (wordsize)
between the two sequences, called hot spots
y Ktup:
o (short for k respective tuples) - an integer parameter, which specifies the length of the
matching substrings
y Diagonal Runs:
o A diagonal run is a set of hot-spots that lie in a consecutive sequence on the same
diagonal (not necessarily adjacent along the diagonal, i.e., spaces between these hot
spots are allowed).
Comments
Larger ktuple increases speed since fewer hits are found but it also decreases sensitivity for
finding similar but not identical sequences since exact matches of this length are required
-
8/8/2019 Bio Chap Notes
25/27
BLAST
intro
BLAST (Basic Local Alignment Search Tool)
The BLASTalgorithm was developed in 1990 (FASTA in 1985)
The motivation for the development of BLAST was the need to increase the speed of FASTA by
finding fewer and better hot spots.
The idea was to integrate the substitution matrix in the first stage of finding the hot spots.
Method (Steps):
1. k-letter word list of the query sequence
2. List the possible matching words
3. Scan Database (For Seeding)
4. Extend exact matches to High Scoring Segment Pairs
5. List all HSP above cutoff score
6. Evaluate Statistical Significance of HSP
7. Combine HSP
8. Show gapped local alignment
Small Definitions:
y segment pairi:
o Given two strings S1 and S2, a segment pairis a pair of equal length substrings ofS1 and
S2, aligned without gaps.
y A locallymaximalsegmentis a segment whose alignment score (without gaps) cannot be
improved by extending it or shortening it.
-
8/8/2019 Bio Chap Notes
26/27
y A maximum segment pair (MSP) in S1 and S2 is a segment pair with the maximum score over all
segment pairs in S1, S2.
y When comparing all the sequences in the database against the query, BLASTattempts to find all
the database sequences that when paired with the query contain a MSPabove some cutoff
score S. we call such pairs HSP.
Types of BLAST
y Two hit
o extension step typically accounts for 90% of BLASTs execution time
o key idea: do extension only when there are two hits on the same diagonal within
distance A of each other
o to maintain sensitivity, lower Tparameter
more single hits found
o but only small fraction have associated 2nd hit
y Gapped
o trigger gapped alignment if two-hit extension has a sufficiently high score
o run DP process both forward & backward from seed
o prune cells when local alignment score falls a certain distance below best score yet
y PSI BLAST
o use results from BLAST query to construct aprofile matrix
o search database with profile instead of query sequence
o iterate
y Profile creation
o The program initially operates on a single query sequence by performing a gapped
BLAST search
o Then, the program takes significant local alignments (hits) found, constructs a multiple
alignment and abstracts a position-specific scoring matrix (PSSM) from this alignment.
o Steps:
Take significant BLAST hits
Make an alignment
Construct profile
The first step of the algorithm is to determine all exact matches of length k (wordsize)
-
8/8/2019 Bio Chap Notes
27/27