sequence alignment algorithms – application to bioinformatics tool development
DESCRIPTION
Sequence Alignment Algorithms – Application to Bioinformatics Tool Development. Dr. S. Parthasarathy Reader and Head Department of Bioinformatics Bharathidasan University Tiruchirappalli – 620 024 (E-mail: [email protected] ). Plan. Introduction to Bioinformatics - PowerPoint PPT PresentationTRANSCRIPT
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
116 FEB 2006
Sequence Alignment Algorithms – Application to Bioinformatics Tool
Development
Dr. S. ParthasarathyReader and Head
Department of BioinformaticsBharathidasan UniversityTiruchirappalli – 620 024
(E-mail: [email protected])
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
216 FEB 2006
Plan
Introduction to BioinformaticsSequence alignment algorithms
Global alignment : Needleman - Wunsch algorithm Local alignment : Smith – Waterman algorithm
– Predict Fold to a protein sequence Methodology Algorithm, Coding & Tool Development Benchmarking
Conclusions
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
316 FEB 2006
Introduction
Why do we need Bioinformatics?
What is Bioinformatics?
Where is Bioinformatics used?
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
416 FEB 2006
Why?
Biological Data Explosion How did Biological Data Explosion
happen?
Sequence Databases are HUGE than the Structure Databases Why so?
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
516 FEB 2006
Introduction Biological Data : Genome Projects
Latest Revolution On 26 June, 2000 - Announcement of completion of
the draft of the ‘Human Genome’ ‘Genetic Code of Human Life is Cracked by
Scientists’ Human Genome contains 3.2 x 109 bps Unit of (Genome) sequence length
bps (base pairs) Mbps (Mega base pairs) = 106 bps Gbps (Giga base pairs) = 109 bps huge (human genome equivalent) = 3.2 Gbps
Unit of Genetic distance centiMorgan (cM) - arbitrary unit ; Named for Thomas Hunt
Morgan (e.g. 1 cM = 0.01 recombinant frequency)
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
616 FEB 2006
Introduction Biological Data : Genome Projects
16 February 2001 15 February 2001
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
716 FEB 2006
Biological Data : Recombinant DNA Technology
Old Revolution 1940 – Role of DNA as the genetic material was confirmed 1953 – Discovery of DNA structure by James Watson & Francis Crick 1966 – Establishment of the Genetic Code 1967 – DNA ligase was isolated – (join two strands of DNA together) – Molecular Glue 1970 – Isolation of Restriction enzyme – Molecular Scissors 1972 – Recombinant DNA molecules were generated at Stanford University, USA 1973 – Joining DNA fragments to the plasmid pSC101 isolated from
E.Coli. They could replicate when introduced into E.Coli. The discoveries of 1972 & 1973 triggered off the biggest
scientific revolution – Genetic Engineering
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
816 FEB 2006
Biological Data explosion
GenBank, NCBI, USA 44 Gbps of DNA & 40 Million Sequences (upto 2004) GenBank, National Center for Biotechnology Information, USA
Protein Data Bank (PDB), RCSB, USA 29,000 structures (2004) PDB, Research Collaboratory for Structural Bioinformatics,
USA QUALITY of Data - HIGH
Experimental error in modern genomic sequencing is extremely low
QUANTITY of Data - HUGE With Recombinant DNA technology & genomic sequencing,
size of sequence data bases is increasing very rapidly SEQUENCE Versus STRUCTURE Databases
Sequence Databases are HUGE than Structure DatabasesLeads to Bioinformatics
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
916 FEB 2006
What?
What is Bioinformatics?
Define Bioinformatics
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
1016 FEB 2006
Bioinformatics - Definition
F(i,j) = max {
F(i-1, j-1)+s(xi,yj),
F(i-1, j) – d,
F(i, j-1) – d.}Bioinformatics is an integration of mathematical, statistical and computer methods to analyze biological data. We use computer programs to make inference from the biological data, to make connections among them and to derive useful and interesting predictions.
The marriage of biology and computer science has created a new field called ‘Bioinformatics’. - Arthur M. Lesk
Bioinformat ics
PEPTIDESE QSEDITPEP
atcggcatgcatcagtcatgcaactg
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
1116 FEB 2006
Biology Basic Definitions
Cell - It is the building block of living organisms Eukaryotic Cells or organisms have the nucleus
separated from the cytoplasm by a nuclear membrane and the genetic material borne on a number of chromosomes consisting of DNA and Protein
Chromosome The physical basis of heredity. Deeply staining rod-like structures present with the nuclei of
eukaryotesContains DNA and protein arranged in compact mannerReplicate identically during cell divisionSame number of chromosomes present in cells of a particular
species (e.g. Human : 22, X and Y)
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
1216 FEB 2006
GenomeBasic Definitions
Genome A complete set of chromosomes inherited from one parent
Gene One of the units of inherited material carried on by
chromosomes. They are arranged in a linear fashion on DNAs. Each represents one character, which is recognized by its effect on the individual bearing the gene in its cells. There are many thousand genes in each nucleus.
DNA (Deoxyribo Nucleic Acid) DNA is made up of FOUR bases a t g c – adenine, thymine, guanine, cytosine
Protein Protein is made up of TWENTY different amino acids A T G C ... – Alanine, Threonine, Glycine, Cysteine, …
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
1316 FEB 2006
Central Dogma
Protein
mRNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
1416 FEB 2006
Genome DataHuman & Model Organisms
Most mapping and sequencing technologies were developed from studies of simpler non-human organisms
Non-Human/Model organisms Bacterium Escherichia Coli - 4.6 Mbp Yeast Saccharomyces Cerevisiae - 12.1 Mbp Fruit Fly Drosophila melanogaster - 180.0 Mbp Roundworm C. elegans - 95.5 Mbp Laboratory Mouse Mus musculus - 3.0 Gbp
Human – more complex genome Human Homo sapiens - 3.2 Gbp
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
1516 FEB 2006
Genome DataHuman (Homo Sapiens)
Genome 1
Chromosomes 23
Genes / DNAs ~ 30,000
Nucleotides 3.2 x 109 bps
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
1616 FEB 2006
Bioinformatics in Genome Research
Data Collection and Interpretation Collecting and Storing Data
Sequence generated by genome research will be used as primary information source for human biology and medicine
The vast amount of data produced will first need to be collected, stored and distributed
Interpretation of Data Recognizing where genes begin and end Searching a database for a particular DNA sequence
may uncover these homologous sequences in a known gene from a model organism, revealing insights into the function of the corresponding human gene
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
1716 FEB 2006
Understanding Gene Function
Correct protein function depends on the 3-D or folded structure the protein assumes in biological environments
Understanding protein structure will be essential in determining gene function
Gene Protein
Function Structure
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
1816 FEB 2006
Where?
Where is Bioinformatics used?
What are the uses of Bioinformatics?
Applications of Bioinformatics
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
1916 FEB 2006
Bioinformatics Tasks
Sequence Analysis (Protein sequences) Similarity & Homology
pairwise local/global alignment• GCG – Seqlab & Seqweb• Scoring Matrices - PAM, BLOSUM
Database Search BLAST, FASTA
Multiple alignmentClustalW, PRINTS, BLOCKS
Secondary Structure Prediction (from Sequence)Proteins – -Helix, β-Sheet, Turn or coilProtein Folding
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2016 FEB 2006
Bioinformatics Tasks Structure analysis – Experimental Determination
X-ray crystallography – 3 dimensional coordinates – Structure Nuclear Magnetic Resonance (NMR)
PDB – Protein Data Bank RasMol – Molecular Viewing Software
High-throughput crystallographic structure determination High flux synchrotron radiation sources (data collection) Multiple anomalous diffraction method (data interpretation)
Bioinformatics - Structure Prediction Homology Modelling – InsightII, SwissPDBViewer, Biosuite ‘ab initio’ method - Monte Carlo Simulation
Protein Structure Classification SCOP - Structural Classification Of Proteins CATH - Class, Architecture, Topology, Homologous superfamily FSSP - Fold Classification based on Structure- Structure alignment of Proteins – obtained by DALI (Distance-matrix ALIgnment)
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2116 FEB 2006
Bioinformatics Tasks
Protein Engineering Mutations
Alter particular amino acid/base for desired effect Site directed mutagenesis
Identify the potential sites where we can do alterations Applications
Agricultural – Genetically Modified Plants, Vegetables, GM Food
Pharmaceutical – Molecular Modelling base Drug Design Medical – Gene Therapy
DNA Bending Application to Genomes
(Ref: M.G.Munteanu, K.Vlahovicek, S.Parthasarathy, I.Simon and S.Pongor, Rod Models of DNA: Sequence-dependent anisotropic elastic modelling of local phenomena, Trends in Biochemical Sciences, 23 (1998) 341-347)
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2216 FEB 2006
Bioinformatics TasksGenomics & Proteomics
Genomics is the study of the structure, content, evolution and functions of genes in genomes
Aims of Genomics To establish an integrated web based
database and research interface To assemble Physical,Genetic and Cytological
maps of the Genome To identify and annotate the complete set of
genes encoded within a genome To provide the resources for comparison with
other genomes
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2316 FEB 2006
Proteomics – Proteome
Proteome is the complete collection of proteins in a cell/tissue/organism at a particular time. Unlike genomes, which are stable over the life time of the organism, proteomes change rapidly as each cell response to its changing environment and produces new proteins and at different amounts.
Genome is a more stable entity. An organism has only one genome but many proteomes.
For an organism, there may be one body wide proteome, about 200 tissue proteomes about a trillion (~1012) individual cell proteomes.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2416 FEB 2006
The study of proteomes that includes determining the 3D shapes of proteins, their roles inside cells, the molecules with which they interact, and defining which proteins are present and how much of each is present at a given time.
Proteomics – Definition
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2516 FEB 2006
To correlate proteins on the basis of their expression profiles.
To observe patterns in protein synthesis and this observed pattern changes can be used as an indicator of the state of cell and its gene expression.
To characterize bacterial pathogens and to develop novel antimicrobials.
To identify regions of the bacterial genome that encode pathogenic determinants.
To develop drugs and in toxicology – Structural Proteomics
Proteomics as a tool for plant genetics and breeding
Proteomics – Applications
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2616 FEB 2006
Systems Biology Systems Biology is a new perspective and emerging
field for research in the post-genomic era. It aims at system level understanding of biological
systems. It studies whole cells/tissues/organisms not by a
traditional reductionist’s approach but by holistic means in a reiterative attempt to model the complete cell/tissue/organism.
It is an integrated and interacting network of genes, proteins and biochemical reactions which give rise to life.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2716 FEB 2006
Systems Biology
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2816 FEB 2006
Sequence Alignment Algorithms
Similarity and Homology
Sequence Comparison - Issues
Types of alignments
Algorithms Used
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2916 FEB 2006
Sequence similarity and homology
Nature is a tinkerer and not an inventor. New sequences are adapted from pre-existing sequences rather than invented de novo . There exists significant similarity between a new sequence and already known sequences. – Fortunate for computational sequence analysis
Similarity – Measurement of resemblance and differences, independent of the source of resemblance.
Homology – The sequences and the organisms in which they occur are descended from a common ancestor.
If two related sequences are homologous, then we can transfer information about structure and/or function, by homology.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3016 FEB 2006
3-D Structure and Homology
3-D structure patterns (motifs) of proteins are much more evolutionarily conserved than amino acid sequences - This type of Homology search could prove more fruitful
Particular motifs may serve similar functions in several different proteins, information that would be valuable in genome analysis
Only a few protein motifs can be recognised at the sequence level
Development of more analytic capabilities to facilitate grouping protein sequences into motif families will make homology searches more useful
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3116 FEB 2006
Sequence ComparisonIssues
Types of alignment Global – end to end matching (Needleman-Wunsch) Local – portions or subsequences matching (Smith-
Waterman) Scoring system used to rank alignments
PAM & BLOSUM matrices Algorithms used to find optimal (or good) scoring
alignments Heuristic Dynamic Programming Hidden Markov Model (HMM)
Statistical methods used to evaluate the significance of an alignment score Z- score, P- value and E- value
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3216 FEB 2006
Substitution Matrices
PAM (Point Accepted Mutation) BLOSUM (BLOcks SUbstitution
Matrix)90
62
30
BLOSUM
Close
Default
Distant
40
250
500
PAM
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3316 FEB 2006
Types of Algorithms Heuristic
A heuristic is an algorithm that will yield reasonable results, even if it is not provably optimal or lacks even a performance guarantee.
In most cases, heuristic methods can be very fast, but they make additional assumptions and will miss the best match for some sequence pairs.
Dynamic Programming The algorithm for finding optimal alignments given an
additive alignment score dynamically (We are going to discuss about it soon.) These type of algorithms are guaranteed to find the
optimal scoring alignment or set of alignments. HMM - Based on Probability Theory – very versatile.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3416 FEB 2006
Global AlignmentNeedleman-Wunsch Algorithm
Formula { F(i-1,j-1) + s(xi,yj)
D F(i, j) = max { F(i-1 , j) - d
H { F(i , j-1) - d
V F(i-1,j-1) D
F(i,j-1) V
F(i-1,j) H F(i,j)
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3516 FEB 2006
Global AlignmentNeedleman-Wunsch Algorithm
Gap penalties Linear score f(g) = - gd Affine score f(g) = - d – (g-1) e
d = gap open penalty e = gap extend penaltyg = gap length
Trace back Take the value in the bottom right corner and
trace back till the end. (i.e. align end – end always).
Algorithm complexity It takes O(nm) time and O(nm) memory, where
n and m are the lengths of the sequences.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3616 FEB 2006
Local AlignmentSmith-Waterman Algorithm
Same as Global alignment algorithm with TWO differences. F(i,j) to take 0 (zero), if all other options
have value less than 0. Alignment can end anywhere in the
matrix.Take the highest value of F(i,j) over the wholematrix and start trace back from there.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3716 FEB 2006
Local AlignmentSmith-Waterman Algorithm
Formula { F(i-1,j-1) + S(xi,yj) D
F(i, j) = max F(i-1 , j) - d H F(i , j-1) - d V 0 (if all other value is <
0) }
F(i-1,j-1) D V F(i,j-1)
F(i-1,j) H F(i,j)
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3816 FEB 2006
Web based server development
Design the web page to get the dataUse cgi-bin or Perl script to parse the
submitted dataInvoke the corresponding program to
get the appropriate resultsSend the results either by e-mail or
to the web page directly
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3916 FEB 2006
Application to Bioinformatics Tool Development
To predict a fold to protein sequence
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
4016 FEB 2006
To predict a fold to protein sequence
To predict possible folds for a given protein sequence, whose structure is not known
To develop a fold recognition technique / tool that is sensitive in detecting folds of given protein sequences in the twilight zone (sequences sharing less than 25% identity)
Application of the fold recognition strategy to genomic annotation
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
4116 FEB 2006
‘Twilight Zone’ sequencesExample
Cytochrome Sequences
256b>256B:A CYTOCHROME $B562 (OXIDIZED) - CHAIN A
ADLEDNMETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPPKLEDKSPDSPEMKD FRHGFDILVGQIDDALKLANEGKVKEAQAAAEQLKTTRNAYHQKYR
>256B:B CYTOCHROME $B562 (OXIDIZED) - CHAIN B ADLEDNMETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPPKLEDKSPDSPEMKD FRHGFDILVGQIDDALKLANEGKVKEAQAAAEQLKTTRNAYHQKYR
2ccy >2CCY:A CYTOCHROME $C(PRIME) - CHAIN A
QQSKPEDLLKLRQGLMQTLKSQWVPIAGFAAGKADLPADAAQRAENMAMVAKLAPIGWAK GTEALPNGETKPEAFGSKSAEFLEGWKALATESTKLAAAAKAGPDALKAQAAATGKVCKA CHEEFKQD
>2CCY:B CYTOCHROME $C(PRIME) - CHAIN B QQSKPEDLLKLRQGLMQTLKSQWVPIAGFAAGKADLPADAAQRAENMAMVAKLAPIGWAK GTEALPNGETKPEAFGSKSAEFLEGWKALATESTKLAAAAKAGPDALKAQAAATGKVCKA CHEEFKQD
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
4216 FEB 2006
ExampleSequences similarity
lalign output for
256b & 2ccy follows …
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
4316 FEB 2006
ExampleCytochrome Structures
CYTOCHROME STRUCTURES (seq. similarity 24%)CYTOCHROME STRUCTURES (seq. similarity 24%)
256b256b
2ccy2ccy
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
4416 FEB 2006
Goals
Exploration of suitable fold recognition techniques that are sensitive in detecting similar folds despite low sequence similarity
Identification of functional motifs in proteins at sequence (1D) and structure (3D) level
Development of a protocol that aid in the rapid classification and annotation of genomic data based on functional motifs
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
4516 FEB 2006
Methodology
Reduction of 3D-structure to 1D-environment string. Environment at each residue position is a function of local secondary structure and extent of exposure to the solvent (based on 3D-1D profile method developed by Eisenberg et al., 1991).
Extract residue environment profiles of the available protein structures.
A scoring matrix is generated from a library of profiles. Each matrix element is the information value of a residue in the given environment.
A library of environment strings is created for the available protein fold structures.
The probe sequence is queried against this library to look for best matches.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
4616 FEB 2006
Scoring Table
PredictFold
Fold library
3D-Profiles
New Sequence
FOLD PREDICTION
1D-Environment Sequence
Annotate New sequence
Workflow
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
4716 FEB 2006
Residue Environments
_Exposed Partially buried
Buried_
_Helix
_CoilStrand_
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
4816 FEB 2006
Residue Environments
The residue environments are described by 1. the area (A) of the
residue buried in the protein
2. the fraction (f) of side-chain area that is covered by polar atoms (O and N)
3. the local secondary structure
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
4916 FEB 2006
Residue Environments
CLASS Area (A) Å2 FRACTION (f)
BURIED 1 (B1) A > 114 f < 0.45BURIED 2 (B2) 0.45 < f < 0.58BURIED 3 (B3) f > 0.58
PARTIAL 1 (P1) 40 < A < 114 f < 0.67PARTIAL 2 (P2) f > 0.67
EXPOSED (E0) A < 40 f > 0.67
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
5016 FEB 2006
Residue Environment classes
We have 6 classes based on the extend of exposure to solvent
We have 3 classes based on secondary structure – Alpha Helix(A), Beta Sheet (B) & Coil(C)
Total : 6 x 3 = 18 environmentsB1A,B1B,B1C, B2A,B2B,B2C, B3A,B3B,B3C
P1A,P1B,P1C, P2A,P2B,P2C, E0A,E0B,E0C.For example
B1A - Buried 1 Alpha Helix P2B - Partially Buried 2 Beta Sheet E0C - Exposed 0 Coil
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
5116 FEB 2006
Scoring Table The scoring table used in this case is a 20 x 18
matrix, constructed from a statistical analysis of the profile library (consisting of 1200 protein structures) provided by PROFILES_3D module of Insight II (Accelrys Inc.)
The scores Sij are calculated using the formula
Sij = ln [ P(i : j) / Pi ] x 100 where P(i : j) is the probability of finding
residue i in the environment j and Pi is the overall probability of finding residue i in any environment.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
5216 FEB 2006
Scoring Table
The scoring table contains measure of the compatibility of the 20 amino acids with the 18 environmental classes.
The individual matrix elements are propensities (information values) for the amino acid residues.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
5316 FEB 2006
Scoring Table
Class A C D E F G H I K L M N P Q R S T V W Y
B1A -77 -43 -248 -215 128 -222 -34 111 -137 130 126 -176 -156 -138 -180 -243 -172 74 111 27B1B -105 -45 -180 -159 96 -235 -226 150 -304 107 51 -218 -77 -203 -152 -256 -127 118 92 17B1C -54 -59 -263 -201 140 -278 -61 93 -278 106 91 -261 59 -84 -235 -299 -141 100 96 52B2A -65 15 -80 -58 87 -204 82 55 -94 71 102 -48 -97 16 -11 -133 -67 41 101 86B2B -152 -72 -208 -20 132 -222 -5 107 -83 36 49 -26 -86 -79 -41 -82 -114 71 83 130B2C -81 -62 -197 -113 104 -171 54 81 -212 77 100 -56 -7 -87 -44 -123 -103 66 162 114B3A -44 -138 -43 9 -22 -109 61 -2 56 16 87 -7 -111 16 110 -101 -69 -29 86 50B3B -108 -79 -76 -42 37 -229 80 26 35 14 -68 -33 -1 52 84 -71 -10 16 7 109B3C -87 -98 -83 -46 71 -61 104 -54 8 29 23 9 -11 10 71 -61 -48 -40 112 125P1A 76 95 -28 -43 -85 -46 -91 -6 -50 -30 -42 -58 -41 -32 -51 47 39 30 -129 -88P1B 59 128 -61 -68 -61 -22 -53 9 -201 -81 -40 -92 -65 -238 -89 49 95 44 34 -9P1C 49 129 34 -59 -129 -39 -121 -28 -72 -33 -90 -26 64 -57 -88 59 55 -9 -125 -140P2A -2 -70 29 62 -135 -58 17 -59 66 -46 -27 -2 -25 62 56 -38 -13 -62 -109 -55P2B -52 -87 -3 41 -56 -87 -49 -35 55 -133 -76 0 -101 10 19 49 79 8 -71 -30P2C -25 -81 51 28 -84 -42 20 -94 47 -68 -83 51 44 25 24 17 8 -74 -42 -43E0A 44 -17 44 60 -181 63 -6 -236 7 -137 -90 32 5 29 -20 16 -20 -125 -126 -170E0B 14 -4 -30 -37 -83 175 -76 -139 -154 -160 -62 1 -88 -12 -112 65 -17 -166 81 -3E0C 14 -35 23 4 -163 110 -41 -163 -10 -114 -130 41 25 -3 -41 34 8 -80 -206 -104
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
5416 FEB 2006
1565 Functional forms
Scan PDB to identify all the structures having these folds
Identify a representative structure with resolution 2.5Å or better
Quality of the structure
(Occupancy, R-Factor, Stereochemistry)
968 Chains
Fold Library
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
5516 FEB 2006
DALI / FSSP Fold Library
DALI : http://www.ebi.ac.uk/dali Touring protein fold space with
DALI/FSSP. Lisa Holm and Chris Sander, Nucleic Acid Research, (1998), 26, 316-319
Mapping the Protein Universe, Lisa Holm and Chris Sander, Science, (1996), 273, 595-602
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
5616 FEB 2006
Sequence ComparisonDetails
Type of Alignment Local - portions or subsequences
matching Smith-Waterman Algorithm
Scoring Table : 3D-1D matrixAlgorithm used : Dynamic
ProgrammingAlignment Score : Z- Score
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
5716 FEB 2006
Local AlignmentSmith-Waterman Algorithm
Formula { F(i-1,j-1) + S(xi,yj) D
F(i, j) = max F(i-1 , j) - d H F(i , j-1) - d V 0 (if all other value is <
0) }
F(i-1,j-1) D V F(i,j-1)
F(i-1,j) H F(i,j)
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
5816 FEB 2006
Gap Penalties
Gap penalties Linear score f(g) = - gd Affine score f(g) = - d – (g-1) e
d = gap open penalty e = gap extend penalty
g = gap length
Gap penalty values used are d = 500 e = 50
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
5916 FEB 2006
Local Alignment
Trace back Alignment can end anywhere in the
matrix Take the highest value of F(i,j) over the
whole matrix and start trace back from there.
Algorithm complexity It takes O(nm) time and O(nm) memory,
where n and m are the lengths of the sequences.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6016 FEB 2006
Significance of an Alignment Score
Statistical methods used to evaluate the significance of an alignment score Z-score, P-value and E-value
Significance of Score Z- score = (score – mean)/std. dev
Measures how unusual our original match is. Z 5 are significant.
P- value measures probability that the alignment is no better than random. (Z and P depends on the distribution of the scores)
P 10-100 exact match. E- value is the expected number of sequences that give
the same Z- score or better. (E = P x size of the database)
E 0.02 sequences probably homologous
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6116 FEB 2006
Benchmarking
All 968 proteins in the fold library were profiled on each of the other members
A histogram indicating the rank and the number of sequences which got the self score as the highest, is shown in Figure.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6216 FEB 2006
Benchmarking
797
633 3 2 17 29 54
0
200
400
600
800
1000
1 2 3 4 5 6 7 8
Rank
No
. o
f S
equ
ence
s
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6316 FEB 2006
Benchmarking
Report 797 retain the self as the highest score 63 report the self to have the second highest score There were about 100 proteins that have ranks
between 5 and 100. Limitations
Prediction is restricted to the 968 folds in the library The algorithm is insensitive to partially folded
sequences Specific to globular proteins and not for membrane
proteins Sequences that fold in the presence of cofactors and
ligands are not accounted for
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6416 FEB 2006
Web based server development
Design the web page to get the dataUse cgi-bin or Perl script to parse the
submitted dataInvoke the corresponding program to get
the appropriate resultsSend the results either by e-mail or to the
web page directlyPrepare a ‘user manual’ to describe the
salient features of the server
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6516 FEB 2006
Conclusions
PredictFold – A program to predict possible folds for a new protein sequence based on the 3D-1D profile method
Benchmarking results show the reliability of the method
There are lot of scopes for further improvements
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6616 FEB 2006
Future Directions
To update the fold library by including more known folds
To use the predicted secondary structure information of the given sequence also
To optimise the source code for efficient handling of genome sequences, automatically
To combine results from other algorithms ORF, HMM, etc. to detect remote homologs
To develop & maintain a web-based sever for fold recognition
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6716 FEB 2006
BT versus IT
Bioinformatics including Biotechnology (BT) requires lot of Information Technology (IT) skills for Genomic annotation projects
Bioinformatics is one of the potential areas for IT professionals also
Genome Projects will be the next huge task for IT industries (like the Y2K problem in the past)
BT will take on IT soon … in the near future …
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6816 FEB 2006
Conclusions
Developing Web based Bioinformatics tools Develop/modify useful algorithms Generate computer source codes Create/Maintain Web based server
Using existing Web based tools efficientlyEthical issues
Bioethics & Biosafety : Ensure always that any bioinformatics tool harmful to environment & society has neither been developed nor been used by you
Cloning of human, Terminator technology, GM Food, etc.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6916 FEB 2006
References (latest) Arthur M. Lesk, Introduction to Bioinformatics, Oxford University
Press, New Delhi (2003). D. Higgins and W. Taylor (Eds), Bioinformatics- Sequence
structure and databanks, Oxford University Press, New Delhi (2000).
R.Durbin, S.R.Eddy, A.Krogh and G.Mitchison, Biological Sequence Analysis, Cambridge Univ. Press, Cambridge, UK (1998).
A. Baxevanis and B.F. Ouellette, Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, (Third Edition) Wiley-Interscience, Hoboken, NJ (2005).
G.Gibson and S.V.Muse, A Primer of Genome Science, Sinauer Associates, USA (2002).
N. C. Jones and P. A. Pevzner, An Introduction to Bioinformatics Algorithms, Ane Books, New Delhi (2005).
Michael S. Waterman, Introduction to computational Biology, Chapman & Hall, (1995).
J. A. Clasel and M. P. Deutscher (Eds), Introduction to Biophysical Methods for Protein and Nucleic Acid Research, Academic press, New York (1995).
D.S. T.Nicholl, An Introduction to Genetic Engineering, (Second Edition) Cambrdige Univ. Press, UK (2002).
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
7016 FEB 2006
References
3D-1D Profile method J.U.Bowie, E.Luthy & D.Eisenberg, Science, 253,
164-170 (1991).Ostensible Recognition of Folds (ORF)
method Rajeev Aurora and George D.Rose, Proc. Natl.
Acad. Sci. (USA), 95(6), 2818-2823 (1998).Superfamily Hidden Markov Model (SHMM)
method A.Krogh, M.Brown, IS.Mian, K.Sjolander and
D.Haussler, J. Mol. Biol. 235(5), 1501-31 (1994).
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
7116 FEB 2006
ImportantBioinformatics Resources
Databases & Tools NCBI, NIH - www.ncbi.nlm.nih.gov EMBL, EBI - www.ebi.ac.uk ExPasy, Swiss - www.expasy.org DDBJ - www.ddbj.nig.ac.jp PDB - www.rcsb.org/pdbSoftware Accelrys - www.accelrys.com/products
GCG, Insight II, Cerius II, Discovery Studio TCS - www.atc.tcs.co.in/biosuite/
BIOSUITE Jalaja Technologies - www.jalaja.com
GENOCLUSTER
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
7216 FEB 2006
Thank You