bio chap notes

8/8/2019 Bio Chap Notes

1/27

NOTES

Protein Docking

Definition:

The process of predicting the structure of complexes formed by protein-protein and protein-ligand

interaction is called docking

Introductory Details

IO

Given two biological molecules determine:

Whether the two molecules interact

- is there an energetically favorable orientation of the

two molecules such that one may modify the others

function

- do the two molecules fit together in any energetically

favorable way

If so, what is the orientation that maximizes the

interaction while minimizing the total energy of the

complex

Importance

DEBAAP

Design of novel pharmaceuticals, drugs, agricultural and biological products.

Essential Roles played by proteins

Bio-Molecular interactions are the core of all Biological Processes

Altering behavior of Protein

Act as Switches


2/27

Protein Engineering

Difficulty

HT

Both molecules are flexible and may alter each others structure as they interact:

Hundreds to thousands of degrees of freedom

Total possible conformations are astronomical

Types of Docking

RF

Rigid Docking

Both the protein and ligand are considered as rigid molecules

Six Degrees of Freedom

Software: GRAMM

Flexible Docking

Either the protein or the ligand or both are considered partially or fully flexible

Search space vast

NEED FOR FLEXIBLE DOCKING

MSLT

It has been observed that when two molecules dock, the movement of atoms may involve

main chains

side chains

large changes in the interface upon complex formation

The atom movements are also observed in the exposed non-interface residues caused

by flexibility and disorder.


3/27

Global Optimization of Potential Energy Function

Need and Details

Final conformation of the docked complex is the one with the minimum energy

The multidimensional energy surface has multiple minima problem.

Need for new specialized global search strategies

Examples

SD

S:SG

D:A

Statistical / Stochastic

Simulated Annealing

Genetic Algorithms

Deterministic

Alpha Branch Bound

Docking Methods

Types of Flexible docking models

RPH

Rigid Protein and Flexible Ligand

Partially Flexible Protein and Flexible Ligand

Hinge Bending in Protein and Ligand


4/27

Rigid Protein Flexible Ligand

Method 1:

Details:

Finds optimal binding site in substrate by exploring various binding sites and

ligand conformations using

Simulated Annealing,

Genetic Algorithms

and Local Search.

Software:

AutoDock

Drawbacks:

Only considers ligand flexibility and hence cannot predict the conformational

changes in the receptor Protein

The selection of flexible torsional angles in the ligand has to be done by the user

Method 2: DSD D:UPU S:F - D:AG

Details:

Use incremental construction algorithm

Place an anchor fragment of ligand in the binding site

Uses greedy algorithm to add fragments and complete the ligand structure

Software

FlexX

Drawbacks:

Anchor Fragment selection is difficult

Greedy algorithm propagates errors resulting from initial bad choices


5/27

Partial Flexible Protein - Flexible Ligand

Details:

Induced Fit Model: BF

Both Protein and Ligand are flexible and undergo a conformational change upon

binding to form a minimum energy perfect-fit

Full Protein flexibility is computationally intractable

Method 1: DSD D:RPC S:I D:AI

Details:

Internal Coordinates Mechanism RPC

Partial protein flexibility in the side chains of the active site

Randomly changes the position of ligand followed by energy

minimization

Comparative assessment with autodock and flexX showed that ICM

provided highest accuracy

Software:

ICM(Internal Coordinates Mechanism)

Drawbacks AI

Allows partial protein flexibility in the side chains of the active site only

Ignores any changes in the backbone atoms or atoms in other regions


6/27

Protein Folding

Definition

Protein folding is the physical process by which a one dimensional chain of amino acids folds

into its characteristic and functional three-dimensional structure.

1D to 3D

Motivation for Protein 3D structure Prediction

Main Motivation:

The structure of the protein is directly related to the proteins functionality.

The reasons for research of 3D structure are: MAID

Medicine:

Understanding biological functions. Binding and unbinding of proteins

constitute much of the cellular activity of living organisms.

Agriculture:

Genetic engineering of better and richer crops.

Industry:

Synthesis of enzymes.

Drug Design:

Finding targets for docking drugs.

What determines Fold?

in general, the amino-acid sequence of a protein determines the 3D shape of a protein

but some exceptions AS

all proteins can be denatured


7/27

some molecules have multiple conformations

Difficulty: Is Problem Hard?

Yes. HD

Huge Search Space:

Difficult to decide criterion for correct fold

Solvable?

Yes, Nature Zindabad

Thus

1. Nature must not sample all conformations

2. Nature knows the correct criteria

Folding Method Criterion

The method of choice for folding a given protein depends on

existence of a similar protein whose structure is already known, and

on the extent of such similarity

Methods of Protein Prediction

1. Comparative Modeling or Homology Modeling

2. Ab initio prediction (from scratch)

3. Fold Recognition or threading (fitting a sequence to existing models)


8/27

Homology or Comparative Modeling

Introduction

Comparative modeling allows us to build a 3-D model for a protein of known aa

sequence but un-known structure, using another protein of known sequence and

structure as a template

Steps: IADIBBRE

Identify Homologous/Parent(s) Sequences: SD

Search the target sequence against the sequences in the protein data bank

using

FASTA

or BLAST.

Distant homologs could be search by using PSI-Blast.

Align the target sequence with Parent(s) 3IA

If sequence identity between target and parent is >70% the alignment is

generally trivial

If identity is below 40% then alignment becomes very difficult

Alignment is done using Dynamic Programming as discussed earlier.

If more than one parents are identified then it is better to Multiple align the

parents structurally first then calculate a MSA and then align the targetsequence to that Multiple Alignment.


9/27

Determine SCRs and SVRs

Multiple Parents

If multiple parents are present then those regions that are same in all

parents are termed as SCRs or core regions and other variable regions

are termed as SVRs

Single Parent

Whereas if one parent is present then initially we assume that all

regions are SCRs except where indels occur. If even one indel occurs in a

loop region then whole of the loop is considered as SVR.

Inherit the SCRs from the parent(s)

The SCR is copied from the parent (s) to use in the model.

Build the SVRs

SVRs are almost always loop regions

When these vary in length from the loops present in the parent

structure(s), they will be built with lower accuracy.

Even where loop regions are conserved in length they can adopt quite

different conformations.

Build the sidechains

Rotamer Library: (Developing Bioinformatics Computer Skills by Cynthia Gibas,

Ref: pg261)


10/27

These contain information about allowed rotations of the remote amino acid

side chain atoms

Refine Model

Usually through (EM) Energy Minimization

Standard mathematical minimization algorithms

EM is only able to move atoms such that a local minimum on the energy surface

is found.

Evaluate Errors in the model

RMSD (Root Mean Square Deviation) is used to say how similar is a structure to

another.

R= (di)2/N

di: distance between each pair of atom

Exact String Matching

Applications:

it has applications in many fields, as:

Text searching

Molecular biology

Data compression

and so on


11/27

Motivation

Recognizing DNA Contamination

Given a string S1( the newly isolated and sequenced string of DNA) and a known string S2 ( the

combined sources of possible contamination), find all substrings of S2 that occur in S1 and are

longer than some given length.

These substrings are candidates of unwanted pieces of S2 that have contaminated the desired

DNA string.

Z-Algorithm

Given string S = {aabadaabcaaba}

Step 1: number alphabets

a a b a d a a b c a a b a

1 2 3 4 5 6 7 8 9 10 11 12 13

Step 2: Calculate Zs

Z2=1{a..a} , Z3= 0, Z4=1{a..a}, Z5=0, Z6=3{aab..aab}, Z7=1{a..a}, Z8=0, Z9=0, Z10=4, Z11=1, Z12=0, Z13=1

Step 3: Draw Z boxes

| | |---------| |--------------|

a a b a d a a b c a a b a

1 2 3 4 5 6 7 8 9 10 11 12 13

Step 4: obtain li and ri values:

i 1 2 3 4 5 6 7 8 9 10 11 12 13

Li - 2 2 4 4 6 6 6 6 10 10 10 10

Ri - 2 2 4 4 8 8 8 8 13 13 13 13


12/27

How to find small pattern P in T

Let $ be a character found neither in P nor in T.

Build the string S = P$T.

String S has length m+n+1, where m n.

Run the Z algorithm on S.

The resulting values Zi hold the property that Zi m. This is due to the presence of character $ in

S.

CASES

Case 1: k > r : (k is outside a z-box)

Case 2: k r: ( k is inside a z-box)

Case 2a: Zk < |B|

Z-box at k is the same as k

Hence, Zk is set to Zk

Values of r and l remain unchanged

Case 2b: Zk >= |B|

The algorithm will start searching for a first mismatch starting from position r+1

Let q be the position of that mismatch

Zk is set to (q-1) k+1 = q-k

R is set to q-1 and l is set to k


13/27

Hidden Markov Model

Definitions

A Markov chain is a random process with the property that the next state depends only on the

current state.

A hidden Markov model (HMM) is a statisticalMarkov model in which the system being

modeled is assumed to be a Markov process with unobserved (hidden) states.

Each state has a probability distribution over the possible output tokens. Therefore the

sequence of tokens generated by an HMM gives some information about the sequence of

states.

N

ote that the adjective 'hidden' refers to the state sequence through which the model passes,not to the parameters of the model; even if the model parameters are known exactly, the model

is still 'hidden'.

Points:

n a regular Markov model, the state is directly visible to the observer, and therefore the state

transition probabilities are the only parameters.

In a hidden Markov model, the state is not directly visible, but output, dependent on the state, is

visible.

Calculate value of viterbi

V (k)= e(of output) * max(V(prev row&col)* Transition from k to l)

Note: jahan maximum value aa rahi hai waha say direction lau gaye


14/27

Viterbi Algorithm

Regular Expression to HMM

State HMM


15/27

Profile HMM

A profile HMM can be obtained from a multiple alignment and can be used for searching a

database for other members of the family in the alignment very much like standard profiles

Parts

The bottom line of states are called the main states, because they model the columns ofthe alignment.

In these states the probability distribution is just the frequency of the amino acids or

nucleotides as in the previous model of the DNA motif.

The second row of diamond shaped states are called insert states and are used to model

highly variable regions in the alignment.

They function exactly like the top state in the previous example.

The top line of circular states are called delete states.

These are a different type of state, called a silent or null state.

They do not match any residues,

they make it possible to jump over one or more columns in the alignment, to model the

situation when just a few of the sequences have a in the multiple alignment at a

position.


16/27

Genetic Algorithm

Gene: A specific sequence of nucleotide bases, that carry information required for constructing proteins.

Proteins: Provide structural components of cells, tissues and enzymes for essential biochemicalreactions.

Overview

y A class of probabilistic optimized search algorithms based on the mechanics of natural selection

and natural genetics.

y Inspired by the biological evolution process. Dinosaurs are dead, cockroaches still surviving.

y Based on the survival of the fittest among string structures with a structured yet randomized

exchange to form search algorithm.

What is Genetic Algorithm

A genetic algorithm maintains a population of candidate solutions for the problem at hand, and makes it

evolve by iteratively applying a set of stochastic operators.

GA vs. Normal Optimization

y GAs work with a coding of the parameter set, not the parameters themselves.

y

GAs search from a population of points, not a single point.y GAs use payoff (objective function) information, not derivatives or other auxiliary knowledge.

y GAs use probabilistic transition rules, not deterministic rules.

NatureGenetic Algorithms

EnvironmentOptimization problem

Individuals living in that environmentFeasible solutions

Individuals degree of adaptation to its surrounding

environment

Solutions quality (fitness function)


17/27

A population of organisms (species)A set of feasible solutions

Selection, recombination and mutation in natures

evolutionary process

Stochastic operators

Evolution of populations to suit their environmentIteratively applying a set of stochastic

operators on a set of feasible solutions

Limitations and weakness

y The idea behind GA is extremely appealing, but has the following limitations:

y It is quite unnatural to model most applications in terms of genetic operators like mutation and

crossover on bit strings.

y The pseudo biology adds another level of complexity between you and your problem.

y Their weakness is that the process of selection alone is too systematic and predictable, not like

creativity as we know it.

y Binary representations are limited in their operations and for certain problems alternative

operators and representations must be used.

y The cross over and mutations make no use of real problem structure, so large fractions of

transitions lead to inferior solutions, and convergence is slow.y GAs take a very long time on non-trivial problems as they generally require more objective

function evaluations as compared to classical optimization techniques. This being a major

practical limitation.

y Analogy with evolution is appropriate, but it took millions of years to achieve significant

improvement. Can we afford to wait that long?


18/27

Iterative search

1. Select Current solution

2. Create new solution

3. Check whether solution met criterion desired4. If not repeat (1) And (2) again

5. Else stop

Neighborhood Methods

y Same as iterative search

y Steps (Descend method):

o Feasible solution i

o N(i) is a set of all neighbors near to the solution i

o J solution is found in I, where f(j)f(i)search stops

Simulated Annealing

Intro to Annealing

y Steps:

o Heat to desired temperature

o Holding at that temperature

o Cooling to room temperature

y Sequences of time and temperature

o Annealing schedules

o Cooling schedules

y Two way are these schedules Critical:

o Difference in Outside(Temp) and In(Temp)

Causes temperature gradients and internal stress


19/27

Hence crack

o Actual annealing time should be long enough for transform to take place

Intro to Simulated Annealing

y Same as neighborhood method

y Problem with global optima descend method:

y Get stuck in local minima due to f(j)>f(i)

y Generating Distribution

o Generates possible valleys or states to be explored

y Accepting distribution

o Difference between function value of the present generated valley to be explored and

the last saved lowest valley

y For certain NP-hard problems where local extrema is mostly reached and global better extrema

is left

o Simulated annealing outperforms them

o It does by straightforward iterative improvement

o Tradeoff: longer running times

Simulated Annealing

y Boltzmann Probability Factor:

o p = exp ( -Hf / T)

where Hf is the increase in f and

T is a control parameter.


20/27

y Requirements for SA :

o A representation of possible solutions

o A generator of random changes in solutions

o A means of evaluating the problem functions

o Annealing schedule

y Method :

1. Input and assess Initial solution

2. Estimate initial temperature

A suitable To is one that results in an average increase of acceptance probability

po of about 0.8.

The value of To will clearly depend on fand, hence, problem-specific.

To estimated by conducting an initial search in which all increases are accepted

and calculating the average increase in fobserved Hf

+

. To = - Hf

+/ ln(po)

3. Generate new solution

4. Assess initial temperature

5. Check whether to Accept new solution

6. If yes, then update stores

7. If no, skip (6)

8. Adjusted temperature

9. Check whether to Terminate search

10. If (9) yes STOP

11.Else again start from step (3)

y Acceptance of search steps (Metropolitan Criterion):

o Assume the performance change in the search direction

o Always accept a descending steps

o Accept a ascending step only if it pass a random test:

y Cooling Schedule

o T, annealing temperature, is the parameter that control the frequency of acceptance of

ascending steps

o We gradually reduce temperature T(k)

o At each temperature search is allowed for a certain number of steps

o The choice of parameters {T(k), and L(k)} are called cooling schedule

? 1,0exp randomT "(


21/27

o EX:

Set L = n, the number of variables in the problem.

Set T(0) such that exp(-(/T(0)) } 1.

Set T(k+1) = EyT(k), where E is a constant smaller but close to 1.

y Algorithm

y Strengths of SA

o Simulated annealing can deal with highly nonlinear models and noisy data and many

constraints.

o It is a robust and general technique.

o Its main advantages over other local search methods are its flexibility and its ability to

approach global optimality.

o The algorithm is quite versatile since it does not rely on any restrictive properties of the

model.


22/27

Heuristic Algorithm

Intro

y Dynamic Programming

o Loose attraction when

Database size 109

o It takes too much time

y Alternatives:

o Hardware

Very fast

Cost expensive

o Distribute computing

Slow than hardware version but still fast Cost expensive

o Heuristics

Much faster than dynamic Programming

Heuristic Method

y Definition:

o A heuristic methodis an algorithm that gives only approximate solution to a givenproblem.

y Important Points:

o Sometimes we are not able to formally prove that this solution actually solves the

problem,

o heuristic methods are commonly used because they are much faster than exact

algorithms.

o In addition, this is a software based strategy, which is therefore relatively cheap and

available to any researcher.


23/27

y Commonly used heuristics are based on the following observations:

o Even linear time complexity will be problematic when database size is huge (over 109).

o Preprocessing of the database is desirable, since numerous queries are run on an

infrequently updated database.

o Substitutions are much more likely than indels.

o We expect homologous sequences to contain a lot of segments with matches or

substitutions, but without indels and gaps.

o These segments can be used as starting points for further searching.

y

Heuristic methods:o FASTA

o BLAST

o BLAST2

FASTA

Intro

The FASTA algorithm is a heuristic method for string comparison.

It compares a query string against a single text string.

When searching the whole database for matches to a given query, we compare the query using

the FASTA algorithm to every string in the database.

Good local alignment is likely to have exact matching subsequences.

The algorithm uses this property and focuses on segments in which there will be an absolute

identity between the two compared strings.

We can use the alignment Dot-Plot matrix for finding these identical regions.


24/27

Method (Steps):

1. Finding Hot Spots (Regions with exact match (ktup))

2. Finding 10 Best Diagonal Runs (No indels)

3. Evaluate Diagonal Runs (using substitution matrices, find Init1)

4. Combine Good Diagonal Runs (Init N (Max. Weight Path in the Graph))

5. Finding alternative local alignment (Dynamic Programming in a Band (opt score))

6. Ranking

Small Definitions:

y Hot Spots:

o The first step of the algorithm is to determine all exact matches of length k (wordsize)

between the two sequences, called hot spots

y Ktup:

o (short for k respective tuples) - an integer parameter, which specifies the length of the

matching substrings

y Diagonal Runs:

o A diagonal run is a set of hot-spots that lie in a consecutive sequence on the same

diagonal (not necessarily adjacent along the diagonal, i.e., spaces between these hot

spots are allowed).

Comments

Larger ktuple increases speed since fewer hits are found but it also decreases sensitivity for

finding similar but not identical sequences since exact matches of this length are required


25/27

BLAST

intro

BLAST (Basic Local Alignment Search Tool)

The BLASTalgorithm was developed in 1990 (FASTA in 1985)

The motivation for the development of BLAST was the need to increase the speed of FASTA by

finding fewer and better hot spots.

The idea was to integrate the substitution matrix in the first stage of finding the hot spots.

Method (Steps):

1. k-letter word list of the query sequence

2. List the possible matching words

3. Scan Database (For Seeding)

4. Extend exact matches to High Scoring Segment Pairs

5. List all HSP above cutoff score

6. Evaluate Statistical Significance of HSP

7. Combine HSP

8. Show gapped local alignment

Small Definitions:

y segment pairi:

o Given two strings S1 and S2, a segment pairis a pair of equal length substrings ofS1 and

S2, aligned without gaps.

y A locallymaximalsegmentis a segment whose alignment score (without gaps) cannot be

improved by extending it or shortening it.


26/27

y A maximum segment pair (MSP) in S1 and S2 is a segment pair with the maximum score over all

segment pairs in S1, S2.

y When comparing all the sequences in the database against the query, BLASTattempts to find all

the database sequences that when paired with the query contain a MSPabove some cutoff

score S. we call such pairs HSP.

Types of BLAST

y Two hit

o extension step typically accounts for 90% of BLASTs execution time

o key idea: do extension only when there are two hits on the same diagonal within

distance A of each other

o to maintain sensitivity, lower Tparameter

more single hits found

o but only small fraction have associated 2nd hit

y Gapped

o trigger gapped alignment if two-hit extension has a sufficiently high score

o run DP process both forward & backward from seed

o prune cells when local alignment score falls a certain distance below best score yet

y PSI BLAST

o use results from BLAST query to construct aprofile matrix

o search database with profile instead of query sequence

o iterate

y Profile creation

o The program initially operates on a single query sequence by performing a gapped

BLAST search

o Then, the program takes significant local alignments (hits) found, constructs a multiple

alignment and abstracts a position-specific scoring matrix (PSSM) from this alignment.

o Steps:

Take significant BLAST hits

Make an alignment

Construct profile

The first step of the algorithm is to determine all exact matches of length k (wordsize)


27/27

bio chap notes

Documents