bio chap notes

Upload: alone-enola

Post on 09-Apr-2018

233 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Bio Chap Notes

    1/27

    NOTES

    Protein Docking

    Definition:

    The process of predicting the structure of complexes formed by protein-protein and protein-ligand

    interaction is called docking

    Introductory Details

    IO

    Given two biological molecules determine:

    Whether the two molecules interact

    - is there an energetically favorable orientation of the

    two molecules such that one may modify the others

    function

    - do the two molecules fit together in any energetically

    favorable way

    If so, what is the orientation that maximizes the

    interaction while minimizing the total energy of the

    complex

    Importance

    DEBAAP

    Design of novel pharmaceuticals, drugs, agricultural and biological products.

    Essential Roles played by proteins

    Bio-Molecular interactions are the core of all Biological Processes

    Altering behavior of Protein

    Act as Switches

  • 8/8/2019 Bio Chap Notes

    2/27

    Protein Engineering

    Difficulty

    HT

    Both molecules are flexible and may alter each others structure as they interact:

    Hundreds to thousands of degrees of freedom

    Total possible conformations are astronomical

    Types of Docking

    RF

    Rigid Docking

    Both the protein and ligand are considered as rigid molecules

    Six Degrees of Freedom

    Software: GRAMM

    Flexible Docking

    Either the protein or the ligand or both are considered partially or fully flexible

    Search space vast

    NEED FOR FLEXIBLE DOCKING

    MSLT

    It has been observed that when two molecules dock, the movement of atoms may involve

    main chains

    side chains

    large changes in the interface upon complex formation

    The atom movements are also observed in the exposed non-interface residues caused

    by flexibility and disorder.

  • 8/8/2019 Bio Chap Notes

    3/27

    Global Optimization of Potential Energy Function

    Need and Details

    Final conformation of the docked complex is the one with the minimum energy

    The multidimensional energy surface has multiple minima problem.

    Need for new specialized global search strategies

    Examples

    SD

    S:SG

    D:A

    Statistical / Stochastic

    Simulated Annealing

    Genetic Algorithms

    Deterministic

    Alpha Branch Bound

    Docking Methods

    Types of Flexible docking models

    RPH

    Rigid Protein and Flexible Ligand

    Partially Flexible Protein and Flexible Ligand

    Hinge Bending in Protein and Ligand

  • 8/8/2019 Bio Chap Notes

    4/27

    Rigid Protein Flexible Ligand

    Method 1:

    Details:

    Finds optimal binding site in substrate by exploring various binding sites and

    ligand conformations using

    Simulated Annealing,

    Genetic Algorithms

    and Local Search.

    Software:

    AutoDock

    Drawbacks:

    Only considers ligand flexibility and hence cannot predict the conformational

    changes in the receptor Protein

    The selection of flexible torsional angles in the ligand has to be done by the user

    Method 2: DSD D:UPU S:F - D:AG

    Details:

    Use incremental construction algorithm

    Place an anchor fragment of ligand in the binding site

    Uses greedy algorithm to add fragments and complete the ligand structure

    Software

    FlexX

    Drawbacks:

    Anchor Fragment selection is difficult

    Greedy algorithm propagates errors resulting from initial bad choices

  • 8/8/2019 Bio Chap Notes

    5/27

    Partial Flexible Protein - Flexible Ligand

    Details:

    Induced Fit Model: BF

    Both Protein and Ligand are flexible and undergo a conformational change upon

    binding to form a minimum energy perfect-fit

    Full Protein flexibility is computationally intractable

    Method 1: DSD D:RPC S:I D:AI

    Details:

    Internal Coordinates Mechanism RPC

    Partial protein flexibility in the side chains of the active site

    Randomly changes the position of ligand followed by energy

    minimization

    Comparative assessment with autodock and flexX showed that ICM

    provided highest accuracy

    Software:

    ICM(Internal Coordinates Mechanism)

    Drawbacks AI

    Allows partial protein flexibility in the side chains of the active site only

    Ignores any changes in the backbone atoms or atoms in other regions

  • 8/8/2019 Bio Chap Notes

    6/27

    Protein Folding

    Definition

    Protein folding is the physical process by which a one dimensional chain of amino acids folds

    into its characteristic and functional three-dimensional structure.

    1D to 3D

    Motivation for Protein 3D structure Prediction

    Main Motivation:

    The structure of the protein is directly related to the proteins functionality.

    The reasons for research of 3D structure are: MAID

    Medicine:

    Understanding biological functions. Binding and unbinding of proteins

    constitute much of the cellular activity of living organisms.

    Agriculture:

    Genetic engineering of better and richer crops.

    Industry:

    Synthesis of enzymes.

    Drug Design:

    Finding targets for docking drugs.

    What determines Fold?

    in general, the amino-acid sequence of a protein determines the 3D shape of a protein

    but some exceptions AS

    all proteins can be denatured

  • 8/8/2019 Bio Chap Notes

    7/27

    some molecules have multiple conformations

    Difficulty: Is Problem Hard?

    Yes. HD

    Huge Search Space:

    Difficult to decide criterion for correct fold

    Solvable?

    Yes, Nature Zindabad

    Thus

    1. Nature must not sample all conformations

    2. Nature knows the correct criteria

    Folding Method Criterion

    The method of choice for folding a given protein depends on

    existence of a similar protein whose structure is already known, and

    on the extent of such similarity

    Methods of Protein Prediction

    1. Comparative Modeling or Homology Modeling

    2. Ab initio prediction (from scratch)

    3. Fold Recognition or threading (fitting a sequence to existing models)

  • 8/8/2019 Bio Chap Notes

    8/27

    Homology or Comparative Modeling

    Introduction

    Comparative modeling allows us to build a 3-D model for a protein of known aa

    sequence but un-known structure, using another protein of known sequence and

    structure as a template

    Steps: IADIBBRE

    Identify Homologous/Parent(s) Sequences: SD

    Search the target sequence against the sequences in the protein data bank

    using

    FASTA

    or BLAST.

    Distant homologs could be search by using PSI-Blast.

    Align the target sequence with Parent(s) 3IA

    If sequence identity between target and parent is >70% the alignment is

    generally trivial

    If identity is below 40% then alignment becomes very difficult

    Alignment is done using Dynamic Programming as discussed earlier.

    If more than one parents are identified then it is better to Multiple align the

    parents structurally first then calculate a MSA and then align the targetsequence to that Multiple Alignment.

  • 8/8/2019 Bio Chap Notes

    9/27

    Determine SCRs and SVRs

    Multiple Parents

    If multiple parents are present then those regions that are same in all

    parents are termed as SCRs or core regions and other variable regions

    are termed as SVRs

    Single Parent

    Whereas if one parent is present then initially we assume that all

    regions are SCRs except where indels occur. If even one indel occurs in a

    loop region then whole of the loop is considered as SVR.

    Inherit the SCRs from the parent(s)

    The SCR is copied from the parent (s) to use in the model.

    Build the SVRs

    SVRs are almost always loop regions

    When these vary in length from the loops present in the parent

    structure(s), they will be built with lower accuracy.

    Even where loop regions are conserved in length they can adopt quite

    different conformations.

    Build the sidechains

    Rotamer Library: (Developing Bioinformatics Computer Skills by Cynthia Gibas,

    Ref: pg261)

  • 8/8/2019 Bio Chap Notes

    10/27

    These contain information about allowed rotations of the remote amino acid

    side chain atoms

    Refine Model

    Usually through (EM) Energy Minimization

    Standard mathematical minimization algorithms

    EM is only able to move atoms such that a local minimum on the energy surface

    is found.

    Evaluate Errors in the model

    RMSD (Root Mean Square Deviation) is used to say how similar is a structure to

    another.

    R= (di)2/N

    di: distance between each pair of atom

    Exact String Matching

    Applications:

    it has applications in many fields, as:

    Text searching

    Molecular biology

    Data compression

    and so on

  • 8/8/2019 Bio Chap Notes

    11/27

    Motivation

    Recognizing DNA Contamination

    Given a string S1( the newly isolated and sequenced string of DNA) and a known string S2 ( the

    combined sources of possible contamination), find all substrings of S2 that occur in S1 and are

    longer than some given length.

    These substrings are candidates of unwanted pieces of S2 that have contaminated the desired

    DNA string.

    Z-Algorithm

    Given string S = {aabadaabcaaba}

    Step 1: number alphabets

    a a b a d a a b c a a b a

    1 2 3 4 5 6 7 8 9 10 11 12 13

    Step 2: Calculate Zs

    Z2=1{a..a} , Z3= 0, Z4=1{a..a}, Z5=0, Z6=3{aab..aab}, Z7=1{a..a}, Z8=0, Z9=0, Z10=4, Z11=1, Z12=0, Z13=1

    Step 3: Draw Z boxes

    | | |---------| |--------------|

    a a b a d a a b c a a b a

    1 2 3 4 5 6 7 8 9 10 11 12 13

    Step 4: obtain li and ri values:

    i 1 2 3 4 5 6 7 8 9 10 11 12 13

    Li - 2 2 4 4 6 6 6 6 10 10 10 10

    Ri - 2 2 4 4 8 8 8 8 13 13 13 13

  • 8/8/2019 Bio Chap Notes

    12/27

    How to find small pattern P in T

    Let $ be a character found neither in P nor in T.

    Build the string S = P$T.

    String S has length m+n+1, where m n.

    Run the Z algorithm on S.

    The resulting values Zi hold the property that Zi m. This is due to the presence of character $ in

    S.

    CASES

    Case 1: k > r : (k is outside a z-box)

    Case 2: k r: ( k is inside a z-box)

    Case 2a: Zk < |B|

    Z-box at k is the same as k

    Hence, Zk is set to Zk

    Values of r and l remain unchanged

    Case 2b: Zk >= |B|

    The algorithm will start searching for a first mismatch starting from position r+1

    Let q be the position of that mismatch

    Zk is set to (q-1) k+1 = q-k

    R is set to q-1 and l is set to k

  • 8/8/2019 Bio Chap Notes

    13/27

    Hidden Markov Model

    Definitions

    A Markov chain is a random process with the property that the next state depends only on the

    current state.

    A hidden Markov model (HMM) is a statisticalMarkov model in which the system being

    modeled is assumed to be a Markov process with unobserved (hidden) states.

    Each state has a probability distribution over the possible output tokens. Therefore the

    sequence of tokens generated by an HMM gives some information about the sequence of

    states.

    N

    ote that the adjective 'hidden' refers to the state sequence through which the model passes,not to the parameters of the model; even if the model parameters are known exactly, the model

    is still 'hidden'.

    Points:

    n a regular Markov model, the state is directly visible to the observer, and therefore the state

    transition probabilities are the only parameters.

    In a hidden Markov model, the state is not directly visible, but output, dependent on the state, is

    visible.

    Calculate value of viterbi

    V (k)= e(of output) * max(V(prev row&col)* Transition from k to l)

    Note: jahan maximum value aa rahi hai waha say direction lau gaye

  • 8/8/2019 Bio Chap Notes

    14/27

    Viterbi Algorithm

    Regular Expression to HMM

    State HMM

  • 8/8/2019 Bio Chap Notes

    15/27

    Profile HMM

    A profile HMM can be obtained from a multiple alignment and can be used for searching a

    database for other members of the family in the alignment very much like standard profiles

    Parts

    The bottom line of states are called the main states, because they model the columns ofthe alignment.

    In these states the probability distribution is just the frequency of the amino acids or

    nucleotides as in the previous model of the DNA motif.

    The second row of diamond shaped states are called insert states and are used to model

    highly variable regions in the alignment.

    They function exactly like the top state in the previous example.

    The top line of circular states are called delete states.

    These are a different type of state, called a silent or null state.

    They do not match any residues,

    they make it possible to jump over one or more columns in the alignment, to model the

    situation when just a few of the sequences have a in the multiple alignment at a

    position.

  • 8/8/2019 Bio Chap Notes

    16/27

    Genetic Algorithm

    Gene: A specific sequence of nucleotide bases, that carry information required for constructing proteins.

    Proteins: Provide structural components of cells, tissues and enzymes for essential biochemicalreactions.

    Overview

    y A class of probabilistic optimized search algorithms based on the mechanics of natural selection

    and natural genetics.

    y Inspired by the biological evolution process. Dinosaurs are dead, cockroaches still surviving.

    y Based on the survival of the fittest among string structures with a structured yet randomized

    exchange to form search algorithm.

    What is Genetic Algorithm

    A genetic algorithm maintains a population of candidate solutions for the problem at hand, and makes it

    evolve by iteratively applying a set of stochastic operators.

    GA vs. Normal Optimization

    y GAs work with a coding of the parameter set, not the parameters themselves.

    y

    GAs search from a population of points, not a single point.y GAs use payoff (objective function) information, not derivatives or other auxiliary knowledge.

    y GAs use probabilistic transition rules, not deterministic rules.

    NatureGenetic Algorithms

    EnvironmentOptimization problem

    Individuals living in that environmentFeasible solutions

    Individuals degree of adaptation to its surrounding

    environment

    Solutions quality (fitness function)

  • 8/8/2019 Bio Chap Notes

    17/27

    A population of organisms (species)A set of feasible solutions

    Selection, recombination and mutation in natures

    evolutionary process

    Stochastic operators

    Evolution of populations to suit their environmentIteratively applying a set of stochastic

    operators on a set of feasible solutions

    Limitations and weakness

    y The idea behind GA is extremely appealing, but has the following limitations:

    y It is quite unnatural to model most applications in terms of genetic operators like mutation and

    crossover on bit strings.

    y The pseudo biology adds another level of complexity between you and your problem.

    y Their weakness is that the process of selection alone is too systematic and predictable, not like

    creativity as we know it.

    y Binary representations are limited in their operations and for certain problems alternative

    operators and representations must be used.

    y The cross over and mutations make no use of real problem structure, so large fractions of

    transitions lead to inferior solutions, and convergence is slow.y GAs take a very long time on non-trivial problems as they generally require more objective

    function evaluations as compared to classical optimization techniques. This being a major

    practical limitation.

    y Analogy with evolution is appropriate, but it took millions of years to achieve significant

    improvement. Can we afford to wait that long?

  • 8/8/2019 Bio Chap Notes

    18/27

    Iterative search

    1. Select Current solution

    2. Create new solution

    3. Check whether solution met criterion desired4. If not repeat (1) And (2) again

    5. Else stop

    Neighborhood Methods

    y Same as iterative search

    y Steps (Descend method):

    o Feasible solution i

    o N(i) is a set of all neighbors near to the solution i

    o J solution is found in I, where f(j)f(i)search stops

    Simulated Annealing

    Intro to Annealing

    y Steps:

    o Heat to desired temperature

    o Holding at that temperature

    o Cooling to room temperature

    y Sequences of time and temperature

    o Annealing schedules

    o Cooling schedules

    y Two way are these schedules Critical:

    o Difference in Outside(Temp) and In(Temp)

    Causes temperature gradients and internal stress

  • 8/8/2019 Bio Chap Notes

    19/27

    Hence crack

    o Actual annealing time should be long enough for transform to take place

    Intro to Simulated Annealing

    y Same as neighborhood method

    y Problem with global optima descend method:

    y Get stuck in local minima due to f(j)>f(i)

    y Generating Distribution

    o Generates possible valleys or states to be explored

    y Accepting distribution

    o Difference between function value of the present generated valley to be explored and

    the last saved lowest valley

    y For certain NP-hard problems where local extrema is mostly reached and global better extrema

    is left

    o Simulated annealing outperforms them

    o It does by straightforward iterative improvement

    o Tradeoff: longer running times

    Simulated Annealing

    y Boltzmann Probability Factor:

    o p = exp ( -Hf / T)

    where Hf is the increase in f and

    T is a control parameter.

  • 8/8/2019 Bio Chap Notes

    20/27

    y Requirements for SA :

    o A representation of possible solutions

    o A generator of random changes in solutions

    o A means of evaluating the problem functions

    o Annealing schedule

    y Method :

    1. Input and assess Initial solution

    2. Estimate initial temperature

    A suitable To is one that results in an average increase of acceptance probability

    po of about 0.8.

    The value of To will clearly depend on fand, hence, problem-specific.

    To estimated by conducting an initial search in which all increases are accepted

    and calculating the average increase in fobserved Hf

    +

    . To = - Hf

    +/ ln(po)

    3. Generate new solution

    4. Assess initial temperature

    5. Check whether to Accept new solution

    6. If yes, then update stores

    7. If no, skip (6)

    8. Adjusted temperature

    9. Check whether to Terminate search

    10. If (9) yes STOP

    11.Else again start from step (3)

    y Acceptance of search steps (Metropolitan Criterion):

    o Assume the performance change in the search direction

    o Always accept a descending steps

    o Accept a ascending step only if it pass a random test:

    y Cooling Schedule

    o T, annealing temperature, is the parameter that control the frequency of acceptance of

    ascending steps

    o We gradually reduce temperature T(k)

    o At each temperature search is allowed for a certain number of steps

    o The choice of parameters {T(k), and L(k)} are called cooling schedule

    ? 1,0exp randomT "(

  • 8/8/2019 Bio Chap Notes

    21/27

    o EX:

    Set L = n, the number of variables in the problem.

    Set T(0) such that exp(-(/T(0)) } 1.

    Set T(k+1) = EyT(k), where E is a constant smaller but close to 1.

    y Algorithm

    y Strengths of SA

    o Simulated annealing can deal with highly nonlinear models and noisy data and many

    constraints.

    o It is a robust and general technique.

    o Its main advantages over other local search methods are its flexibility and its ability to

    approach global optimality.

    o The algorithm is quite versatile since it does not rely on any restrictive properties of the

    model.

  • 8/8/2019 Bio Chap Notes

    22/27

    Heuristic Algorithm

    Intro

    y Dynamic Programming

    o Loose attraction when

    Database size 109

    o It takes too much time

    y Alternatives:

    o Hardware

    Very fast

    Cost expensive

    o Distribute computing

    Slow than hardware version but still fast Cost expensive

    o Heuristics

    Much faster than dynamic Programming

    Heuristic Method

    y Definition:

    o A heuristic methodis an algorithm that gives only approximate solution to a givenproblem.

    y Important Points:

    o Sometimes we are not able to formally prove that this solution actually solves the

    problem,

    o heuristic methods are commonly used because they are much faster than exact

    algorithms.

    o In addition, this is a software based strategy, which is therefore relatively cheap and

    available to any researcher.

  • 8/8/2019 Bio Chap Notes

    23/27

    y Commonly used heuristics are based on the following observations:

    o Even linear time complexity will be problematic when database size is huge (over 109).

    o Preprocessing of the database is desirable, since numerous queries are run on an

    infrequently updated database.

    o Substitutions are much more likely than indels.

    o We expect homologous sequences to contain a lot of segments with matches or

    substitutions, but without indels and gaps.

    o These segments can be used as starting points for further searching.

    y

    Heuristic methods:o FASTA

    o BLAST

    o BLAST2

    FASTA

    Intro

    The FASTA algorithm is a heuristic method for string comparison.

    It compares a query string against a single text string.

    When searching the whole database for matches to a given query, we compare the query using

    the FASTA algorithm to every string in the database.

    Good local alignment is likely to have exact matching subsequences.

    The algorithm uses this property and focuses on segments in which there will be an absolute

    identity between the two compared strings.

    We can use the alignment Dot-Plot matrix for finding these identical regions.

  • 8/8/2019 Bio Chap Notes

    24/27

    Method (Steps):

    1. Finding Hot Spots (Regions with exact match (ktup))

    2. Finding 10 Best Diagonal Runs (No indels)

    3. Evaluate Diagonal Runs (using substitution matrices, find Init1)

    4. Combine Good Diagonal Runs (Init N (Max. Weight Path in the Graph))

    5. Finding alternative local alignment (Dynamic Programming in a Band (opt score))

    6. Ranking

    Small Definitions:

    y Hot Spots:

    o The first step of the algorithm is to determine all exact matches of length k (wordsize)

    between the two sequences, called hot spots

    y Ktup:

    o (short for k respective tuples) - an integer parameter, which specifies the length of the

    matching substrings

    y Diagonal Runs:

    o A diagonal run is a set of hot-spots that lie in a consecutive sequence on the same

    diagonal (not necessarily adjacent along the diagonal, i.e., spaces between these hot

    spots are allowed).

    Comments

    Larger ktuple increases speed since fewer hits are found but it also decreases sensitivity for

    finding similar but not identical sequences since exact matches of this length are required

  • 8/8/2019 Bio Chap Notes

    25/27

    BLAST

    intro

    BLAST (Basic Local Alignment Search Tool)

    The BLASTalgorithm was developed in 1990 (FASTA in 1985)

    The motivation for the development of BLAST was the need to increase the speed of FASTA by

    finding fewer and better hot spots.

    The idea was to integrate the substitution matrix in the first stage of finding the hot spots.

    Method (Steps):

    1. k-letter word list of the query sequence

    2. List the possible matching words

    3. Scan Database (For Seeding)

    4. Extend exact matches to High Scoring Segment Pairs

    5. List all HSP above cutoff score

    6. Evaluate Statistical Significance of HSP

    7. Combine HSP

    8. Show gapped local alignment

    Small Definitions:

    y segment pairi:

    o Given two strings S1 and S2, a segment pairis a pair of equal length substrings ofS1 and

    S2, aligned without gaps.

    y A locallymaximalsegmentis a segment whose alignment score (without gaps) cannot be

    improved by extending it or shortening it.

  • 8/8/2019 Bio Chap Notes

    26/27

    y A maximum segment pair (MSP) in S1 and S2 is a segment pair with the maximum score over all

    segment pairs in S1, S2.

    y When comparing all the sequences in the database against the query, BLASTattempts to find all

    the database sequences that when paired with the query contain a MSPabove some cutoff

    score S. we call such pairs HSP.

    Types of BLAST

    y Two hit

    o extension step typically accounts for 90% of BLASTs execution time

    o key idea: do extension only when there are two hits on the same diagonal within

    distance A of each other

    o to maintain sensitivity, lower Tparameter

    more single hits found

    o but only small fraction have associated 2nd hit

    y Gapped

    o trigger gapped alignment if two-hit extension has a sufficiently high score

    o run DP process both forward & backward from seed

    o prune cells when local alignment score falls a certain distance below best score yet

    y PSI BLAST

    o use results from BLAST query to construct aprofile matrix

    o search database with profile instead of query sequence

    o iterate

    y Profile creation

    o The program initially operates on a single query sequence by performing a gapped

    BLAST search

    o Then, the program takes significant local alignments (hits) found, constructs a multiple

    alignment and abstracts a position-specific scoring matrix (PSSM) from this alignment.

    o Steps:

    Take significant BLAST hits

    Make an alignment

    Construct profile

    The first step of the algorithm is to determine all exact matches of length k (wordsize)

  • 8/8/2019 Bio Chap Notes

    27/27