an introduction to
TRANSCRIPT
Dr.S.Parthasarathy, NIT, Trichy1
12/04/23
An introduction to BIOinformatics AlgoRITHMS
Dr. S. ParthasarathyNational Institute of Technology
Tiruchirappalli – 620 015(E-mail: [email protected])
Dr.S.Parthasarathy, NIT, Trichy2
12/04/23
Plan
IntroductionOverview of BioinformaticsBioinformatics Algorithms
Pairwise Sequence Alignment Database Search
PickFold – Sequence to Protein FoldFuture Perspective of Biological Crisis
Management – Bioinformatics point of viewConclusion
Dr.S.Parthasarathy, NIT, Trichy3
12/04/23
IntroductionBiological Data
On 26 June, 2000 - Announcement of completion of the draft of the ‘Human Genome’
Human Genome contains 3.2 x 109 bpsUnits of (Genome) sequence length
bps (base pairs) Mbps (Mega base pairs) = 106 bps Gbps (Giga base pairs) = 109 bps huge (human genome equivalent) = 3.2 Gbps
Dr.S.Parthasarathy, NIT, Trichy4
12/04/23
Growth of GenBank
0
4000
8000
12000
16000
20000
1991 1993 1995 1997 1999 2001
Year
Cum
ulat
ive
Size
(Mbp
s)
Growth of Protein Data Bank
0
4000
8000
12000
16000
20000
1991 1993 1995 1997 1999 2001
Year
Cum
ulat
ive
Size
Biological Data Explosion
Dr.S.Parthasarathy, NIT, Trichy5
12/04/23
Biological Data explosion
GenBank, NCBI, USA --- 16 Gbps GenBank, National Center for Biotechnology Information, USA
PDB, RCSB, USA --- 16,000 structures PDB, Research Collaboratory for Structural Bioinformatics,
USA
QUALITY - HIGH Experimental error in modern genomic sequencing is
extremely low QUANTITY - HUGE
With genomic sequencing & Recombinant DNA technology, size of sequence data bases is increasing very rapidly.
Dr.S.Parthasarathy, NIT, Trichy6
12/04/23
Bioinformatics - Definition
F(i,j) = max {
F(i-1, j-1)+s(xi,yj),
F(i-1, j) – d,
F(i, j-1) – d.}
Bioinformatics is an integration of mathematical, statistical and computer methods to analyze biological data. We use computer programs to make inference from the biological data, to make connections among them and to derive useful and interesting predictions.
The marriage of biology and computer science has created a new field called ‘Bioinformatics’.
Dr.S.Parthasarathy, NIT, Trichy7
12/04/23
Bioinformatic Goals
To understand integrative aspects of the biology of organisms, viewed as coherent complex structures
To interrelate sequences, 3-D structures, interactions and functions of proteins, nucleic acids and their complexes
To study the evolution of biological systems
To support applications in agricultural, pharmaceutical and other scientific fields
Dr.S.Parthasarathy, NIT, Trichy8
12/04/23
Biological SystemsOverview
SPECIES
ECO SYSTEMS
ORGANISMS
CELLS
BIOSPHERE
Dr.S.Parthasarathy, NIT, Trichy9
12/04/23
Biology Basic Definitions
Cell - It is the building block of living organisms Eukaryotic Cells or organisms have the nucleus
separated from the cytoplasam by a nuclear membrane and the genetic material borne on a number of chromosomes consisting of DNA and Protein
Chromosome The physical basis of heredity. Deeply staining rod-like structures present with the nuclei of
eukaryotesContains DNA and protein arranged in compact mannerReplicate identically during cell divisionSame number of chromosomes present in cells of a particular
species (e.g. Human : 22, X and Y)
Dr.S.Parthasarathy, NIT, Trichy10
12/04/23
GenomeBasic Definitions
Gene One of the units of inherited material carried on by chromosomes.
They are arranged in a linear fashion on DNAs. Each represents one character, which is recognized by its effect on the individual bearing the gene in its cells. There are many thousand genes in each nucleus.
Genome A set of chromosomes inherited from one parent
DNA Deoxyribo Nucleic Acid made up of FOUR bases a t g c – adenine, thymine, guanine, cytosine
Proteins made up of TWENTY different aminoacids A T G C … – Alanine, Threonine, Glycine, Cysteine, …
Dr.S.Parthasarathy, NIT, Trichy11
12/04/23
Bioinformatics Tasks
Sequence Analysis Similarity & Homology –
pairwise local/global alignment• Scoring Matrices - PAM, BLOSUM
Database Search BLAST, FASTA, GCG
Multiple alignmentClustalW, PRINTS, BLOCKS
Secondary Structure PredictionProteins – -Helix, β-Sheet, Turn or coilProtein Folding
Dr.S.Parthasarathy, NIT, Trichy12
12/04/23
Bioinformatics Tasks
Structure analysis X-ray crystallograpy – 3 dimensional coordinates
– StructurePDB – Protein Data Bank RasMol – Molecular Viewing Software
Protein Structure Databases SCOP - Structural Classification Of Proteins
CATH - Class, Architecture, Topology, Homologous superfamily
FSSP - Fold Classification based on Structure- Structure alignment
of Proteins – obtained by DALI (Distance-matrix ALIgnment)
Dr.S.Parthasarathy, NIT, Trichy13
12/04/23
Bioinformatics Tasks
Protein Engineering Mutations
Alter particular aminoacid/base for desired effect Site directed mutagenesis
Identify the potential sites where we can do alterations
DNA Bending Application to Genomes
Dr.S.Parthasarathy, NIT, Trichy14
12/04/23
Sequence similarity, homology and alignments
Nature is a tinkerer and not an inventor. New sequences are adapted from pre-existing sequences rather than invented de novo . There exists significant similarity between a new sequence and already known sequences.
Similarity – Measurement of resemblance and differences, independent of the source of resemblance.
Homology – The sequences and the organisms in which they occur are descended from a common ancestor.
If two related sequences are homologous, then we can transfer information about structure and/or function, by homology.
Dr.S.Parthasarathy, NIT, Trichy15
12/04/23
Sequence Comparison Issues
Types of alignment Global – end to end matching (Needleman-Wunsch) Local – portions or subsequences matching (Smith-
Waterman) Scoring system used to rank alignments
PAM & BLOSUM matrices Algorithms used to find optimal (or good) scoring alignments
Heuristic Dynamic Programming Hidden Markov Model (HMM)
Statistical methods used to evaluate the significance of an alignment score
Z-score, E-value, etc.
Dr.S.Parthasarathy, NIT, Trichy16
12/04/23
Substitution Matrices
PAM (Point Accepted Mutation) BLOSUM (BLOcks SUbstitution
Matrix)90
62
30
BLOSUM
Close
Default
Distant
40
250
500
PAM
Dr.S.Parthasarathy, NIT, Trichy17
12/04/23
Types of Algorithms Heuristic
A heuristic is an algorithm that will yield reasonable results, even if it is not provably optimal or lacks even a performance guarantee.
In most cases, heuristic methods can be very fast, but they make additional assumptions and will miss the best match for some sequence pairs.
Dynamic Programming The algorithm for finding optimal alignments given an
additive alignment score dynamically (We are going to discuss about it soon.) These type of algorithms are guaranteed to find the
optimal scoring alignment or set of alignments. HMM - Based on Probability Theory – very versatile.
Dr.S.Parthasarathy, NIT, Trichy18
12/04/23
Global AlignmentNeedleman-Wunsch Algorithm
Formula F(i-1,j-1) + s(xi,yj)
D F(i, j) = max sF(i-1 , j) - d H F(i , j-1) - d V
F(i-1,j-1) D
F(i,j-1) V
F(i-1,j) H F(i,j)
Dr.S.Parthasarathy, NIT, Trichy19
12/04/23
Global AlignmentNeedleman-Wunsch Algorithm
Gap penalties Linear score f(g) = - gd Affine score f(g) = - d – (g-1) e
d = gap open penalty e = gap extend penaltyg = gap length
Trace back Take the value in the bottom right corner and
trace back till the end. (i.e. align end – end always).
Algorithm complexity It takes O(nm) time and O(nm) memory, where
n and m are the lengths of the sequences.
Dr.S.Parthasarathy, NIT, Trichy20
12/04/23
Local AlignmentSmith-Waterman Algorithm
Same as Global alignment algorithm with TWO differences.1. F(i,j) to take 0 (zero), if all other
options have value less than 0.2. Alignment can end anywhere in the
matrix. Take the highest value of F(i,j) over the
whole matrix and start trace back from there.
Dr.S.Parthasarathy, NIT, Trichy21
12/04/23
Sequence Database Search
Heuristic sequence database searching packages BLAST & FASTA
Significance of Score Z – score = (score – mean)/std. dev
Measures how unusual our original match is. Z 5 are significant.
P – value measures probability that the alignment is no better than random. (Z and P depends on the distribution of the scores)
P 10-100 exact match. E – value is the expected number of sequences that give
the same Z-score or better. (E = P x size of the database)
E 0.02 sequences probably homologous
Dr.S.Parthasarathy, NIT, Trichy22
12/04/23
Web based server development
Design the web page to get the dataUse cgi-bin or Perl script to parse the
submitted dataInvoke the corresponding program to
get the appropriate resultsSend the results either by e-mail or
to the web page directly
Dr.S.Parthasarathy, NIT, Trichy23
12/04/23
PickFold
Predict a fold for an amino acid sequence
To develop a fold recognition technique that is sensitive in detecting folds of sequences in the twilight zone (sequences sharing less than 25% identity).
Dr.S.Parthasarathy, NIT, Trichy24
12/04/23
Workflow
Scoring Table
PickFold
Fold library
3D-Profiles
New Sequence
FOLD PREDICTION
1D-Environment Sequence
Annotate New sequence
Dr.S.Parthasarathy, NIT, Trichy25
12/04/23
PickFold
Sequence to Protein FoldFollows …
Dr.S.Parthasarathy, NIT, Trichy26
12/04/23
Biological Crisis Management
Future Perspective of Biological Crisis Management
Follows …
Dr.S.Parthasarathy, NIT, Trichy27
12/04/23
Applications ofBioinformatics
Agricultural Genetically Modified Plants, Vegetables GM Food
Pharmaceutical Molecular Modelling based Drug
DiscoveryMedical
Gene Therapy
Dr.S.Parthasarathy, NIT, Trichy28
12/04/23
Bioinformatics Skills
Algorithm development Coding – Testing – Documentation
Programming Skills in C, C++, Java, … Data Structures – Sorting, Searching, Statistics &
Probability
Database Management Creation, Compilation, Updation & Web based search
CGI bin scripts, Java Scripts, Perl, JDBC, ASP, ...
Graphics 2D & 3D graphics - GUI
Web page design & Automatic Web servers Java Applets, Java Scripts, Java Servlets, RMI, …
Commercial Products - Package/ Tools – Sales !!
Dr.S.Parthasarathy, NIT, Trichy29
12/04/23
ImportantBioinformatics Resources
NCBI, NIH - www.ncbi.nlm.nih.govEMBL, EBI - www.ebi.ac.ukExPasy, Swiss - www.expasy.orgDDBJ - www.ddbj.nig.ac.jpPDB - www.rcsb.org/pdbGCG - www.gcg.com
Dr.S.Parthasarathy, NIT, Trichy30
12/04/23
BIOINFORMATICS JOBS
Bioinformatics Scientist / AnalystBio-programmerBioinformatics software engineerWeb DeveloperNetwork ProgrammerDatabase ProgrammerSystem Engineer / Analyst
Dr.S.Parthasarathy, NIT, Trichy31
12/04/23
BT versus IT
Bioinformatics including Biotechnology (BT) requires lot of Information Technology (IT) skills for Genomic annotation projects
Bioinformatics is one of the potential areas for IT professionals also
Genome Projects will be the next huge task for IT industries (like the Y2K problem in the past)
BT will take on IT soon … in the near future …
Dr.S.Parthasarathy, NIT, Trichy32
12/04/23
Conclusions
Developing Web based Bioinformatics tools Develop/modify useful algorithms Generate computer source codes Create/Maintain Web based server
Using existing Web based tools efficientlyBio-ethics & Bio-safety
Ensure always that any bioinformatics tool harmful to environment & society has neither been developed nor used by you
Dr.S.Parthasarathy, NIT, Trichy33
12/04/23
References (latest) N. C. Jones and P. A. Pevzner, An Introduction to
Bioinformatics Algorithms, Ane Books, New Delhi (2005).
Arthur M. Lesk, Introduction to Bioinformatics, Oxford University Press, New Delhi (2003).
D. Higgins and W. Taylor (Eds), Bioinformatics- Sequence structure and databanks, Oxford University Press, New Delhi (2000).
R.Durbin, S.R.Eddy, A.Krogh and G.Mitchison, Biological Sequence Analysis, Cambridge Univ. Press, Cambridge, UK (1998).
A. Baxevanis and B.F. Ouellette, Bioinformatics: A practical Guide to the Analysis of Genes and Proteins, Wiley-Interscience, Hoboken, NJ (1998).
Michael S. Waterman, Introduction to computational Biology, Chapman & Hall, (1995).
J. A. Clasel and M. P. Deutscher (Eds), Introduction to Biophysical Methods for Protein and Nucleic Acid Research, Academic press, New York (1995).
Dr.S.Parthasarathy, NIT, Trichy34
12/04/23
Lecture Notes
Available at ICGEBnet Distant homology
www.icgeb.org/~netsrv/courseware/Title.htm
Biorithmswww.icgeb.org/~netsrv/courseware/biorithms/
index.htm
Dr.S.Parthasarathy, NIT, Trichy35
12/04/23
Thank You