an introduction to

Dr.S.Parthasarathy, NIT, Trichy1

12/04/23

An introduction to BIOinformatics AlgoRITHMS

Dr. S. ParthasarathyNational Institute of Technology

Tiruchirappalli – 620 015(E-mail: [email protected])


12/04/23

Plan

IntroductionOverview of BioinformaticsBioinformatics Algorithms

Pairwise Sequence Alignment Database Search

PickFold – Sequence to Protein FoldFuture Perspective of Biological Crisis

Management – Bioinformatics point of viewConclusion


12/04/23

IntroductionBiological Data

On 26 June, 2000 - Announcement of completion of the draft of the ‘Human Genome’

Human Genome contains 3.2 x 109 bpsUnits of (Genome) sequence length

bps (base pairs) Mbps (Mega base pairs) = 106 bps Gbps (Giga base pairs) = 109 bps huge (human genome equivalent) = 3.2 Gbps


12/04/23

Growth of GenBank

0

4000

8000

12000

16000

20000

1991 1993 1995 1997 1999 2001

Year

Cum

ulat

ive

Size

(Mbp

s)

Growth of Protein Data Bank

0

4000

8000

12000

16000

20000

1991 1993 1995 1997 1999 2001

Year

Cum

ulat

ive

Size

Biological Data Explosion


12/04/23

Biological Data explosion

GenBank, NCBI, USA --- 16 Gbps GenBank, National Center for Biotechnology Information, USA

PDB, RCSB, USA --- 16,000 structures PDB, Research Collaboratory for Structural Bioinformatics,

USA

QUALITY - HIGH Experimental error in modern genomic sequencing is

extremely low QUANTITY - HUGE

With genomic sequencing & Recombinant DNA technology, size of sequence data bases is increasing very rapidly.


12/04/23

Bioinformatics - Definition

F(i,j) = max {

F(i-1, j-1)+s(xi,yj),

F(i-1, j) – d,

F(i, j-1) – d.}

Bioinformatics is an integration of mathematical, statistical and computer methods to analyze biological data. We use computer programs to make inference from the biological data, to make connections among them and to derive useful and interesting predictions.

The marriage of biology and computer science has created a new field called ‘Bioinformatics’.


12/04/23

Bioinformatic Goals

To understand integrative aspects of the biology of organisms, viewed as coherent complex structures

To interrelate sequences, 3-D structures, interactions and functions of proteins, nucleic acids and their complexes

To study the evolution of biological systems

To support applications in agricultural, pharmaceutical and other scientific fields


12/04/23

Biological SystemsOverview

SPECIES

ECO SYSTEMS

ORGANISMS

CELLS

BIOSPHERE


12/04/23

Biology Basic Definitions

Cell - It is the building block of living organisms Eukaryotic Cells or organisms have the nucleus

separated from the cytoplasam by a nuclear membrane and the genetic material borne on a number of chromosomes consisting of DNA and Protein

Chromosome The physical basis of heredity. Deeply staining rod-like structures present with the nuclei of

eukaryotesContains DNA and protein arranged in compact mannerReplicate identically during cell divisionSame number of chromosomes present in cells of a particular

species (e.g. Human : 22, X and Y)


12/04/23

GenomeBasic Definitions

Gene One of the units of inherited material carried on by chromosomes.

They are arranged in a linear fashion on DNAs. Each represents one character, which is recognized by its effect on the individual bearing the gene in its cells. There are many thousand genes in each nucleus.

Genome A set of chromosomes inherited from one parent

DNA Deoxyribo Nucleic Acid made up of FOUR bases a t g c – adenine, thymine, guanine, cytosine

Proteins made up of TWENTY different aminoacids A T G C … – Alanine, Threonine, Glycine, Cysteine, …


12/04/23

Bioinformatics Tasks

Sequence Analysis Similarity & Homology –

pairwise local/global alignment• Scoring Matrices - PAM, BLOSUM

Database Search BLAST, FASTA, GCG

Multiple alignmentClustalW, PRINTS, BLOCKS

Secondary Structure PredictionProteins – -Helix, β-Sheet, Turn or coilProtein Folding


12/04/23


Structure analysis X-ray crystallograpy – 3 dimensional coordinates

– StructurePDB – Protein Data Bank RasMol – Molecular Viewing Software

Protein Structure Databases SCOP - Structural Classification Of Proteins

CATH - Class, Architecture, Topology, Homologous superfamily

FSSP - Fold Classification based on Structure- Structure alignment

of Proteins – obtained by DALI (Distance-matrix ALIgnment)


12/04/23


Protein Engineering Mutations

Alter particular aminoacid/base for desired effect Site directed mutagenesis

Identify the potential sites where we can do alterations

DNA Bending Application to Genomes


12/04/23

Sequence similarity, homology and alignments

Nature is a tinkerer and not an inventor. New sequences are adapted from pre-existing sequences rather than invented de novo . There exists significant similarity between a new sequence and already known sequences.

Similarity – Measurement of resemblance and differences, independent of the source of resemblance.

Homology – The sequences and the organisms in which they occur are descended from a common ancestor.

If two related sequences are homologous, then we can transfer information about structure and/or function, by homology.


12/04/23

Sequence Comparison Issues

Types of alignment Global – end to end matching (Needleman-Wunsch) Local – portions or subsequences matching (Smith-

Waterman) Scoring system used to rank alignments

PAM & BLOSUM matrices Algorithms used to find optimal (or good) scoring alignments

Heuristic Dynamic Programming Hidden Markov Model (HMM)

Statistical methods used to evaluate the significance of an alignment score

Z-score, E-value, etc.


12/04/23

Substitution Matrices

PAM (Point Accepted Mutation) BLOSUM (BLOcks SUbstitution

Matrix)90

62

30

BLOSUM

Close

Default

Distant

40

250

500

PAM


12/04/23

Types of Algorithms Heuristic

A heuristic is an algorithm that will yield reasonable results, even if it is not provably optimal or lacks even a performance guarantee.

In most cases, heuristic methods can be very fast, but they make additional assumptions and will miss the best match for some sequence pairs.

Dynamic Programming The algorithm for finding optimal alignments given an

additive alignment score dynamically (We are going to discuss about it soon.) These type of algorithms are guaranteed to find the

optimal scoring alignment or set of alignments. HMM - Based on Probability Theory – very versatile.


12/04/23

Global AlignmentNeedleman-Wunsch Algorithm

Formula F(i-1,j-1) + s(xi,yj)

D F(i, j) = max sF(i-1 , j) - d H F(i , j-1) - d V

F(i-1,j-1) D

F(i,j-1) V

F(i-1,j) H F(i,j)


12/04/23

Global AlignmentNeedleman-Wunsch Algorithm

Gap penalties Linear score f(g) = - gd Affine score f(g) = - d – (g-1) e

d = gap open penalty e = gap extend penaltyg = gap length

Trace back Take the value in the bottom right corner and

trace back till the end. (i.e. align end – end always).

Algorithm complexity It takes O(nm) time and O(nm) memory, where

n and m are the lengths of the sequences.


12/04/23

Local AlignmentSmith-Waterman Algorithm

Same as Global alignment algorithm with TWO differences.1. F(i,j) to take 0 (zero), if all other

options have value less than 0.2. Alignment can end anywhere in the

matrix. Take the highest value of F(i,j) over the

whole matrix and start trace back from there.


12/04/23

Sequence Database Search

Heuristic sequence database searching packages BLAST & FASTA

Significance of Score Z – score = (score – mean)/std. dev

Measures how unusual our original match is. Z 5 are significant.

P – value measures probability that the alignment is no better than random. (Z and P depends on the distribution of the scores)

P 10-100 exact match. E – value is the expected number of sequences that give

the same Z-score or better. (E = P x size of the database)

E 0.02 sequences probably homologous


12/04/23

Web based server development

Design the web page to get the dataUse cgi-bin or Perl script to parse the

submitted dataInvoke the corresponding program to

get the appropriate resultsSend the results either by e-mail or

to the web page directly


12/04/23

PickFold

Predict a fold for an amino acid sequence

To develop a fold recognition technique that is sensitive in detecting folds of sequences in the twilight zone (sequences sharing less than 25% identity).


12/04/23

Workflow

Scoring Table

PickFold

Fold library

3D-Profiles

New Sequence

FOLD PREDICTION

1D-Environment Sequence

Annotate New sequence


12/04/23

PickFold

Sequence to Protein FoldFollows …


12/04/23

Biological Crisis Management

Future Perspective of Biological Crisis Management

Follows …


12/04/23

Applications ofBioinformatics

Agricultural Genetically Modified Plants, Vegetables GM Food

Pharmaceutical Molecular Modelling based Drug

DiscoveryMedical

Gene Therapy


12/04/23

Bioinformatics Skills

Algorithm development Coding – Testing – Documentation

Programming Skills in C, C++, Java, … Data Structures – Sorting, Searching, Statistics &

Probability

Database Management Creation, Compilation, Updation & Web based search

CGI bin scripts, Java Scripts, Perl, JDBC, ASP, ...

Graphics 2D & 3D graphics - GUI

Web page design & Automatic Web servers Java Applets, Java Scripts, Java Servlets, RMI, …

Commercial Products - Package/ Tools – Sales !!


12/04/23

ImportantBioinformatics Resources

NCBI, NIH - www.ncbi.nlm.nih.govEMBL, EBI - www.ebi.ac.ukExPasy, Swiss - www.expasy.orgDDBJ - www.ddbj.nig.ac.jpPDB - www.rcsb.org/pdbGCG - www.gcg.com


12/04/23

BIOINFORMATICS JOBS

Bioinformatics Scientist / AnalystBio-programmerBioinformatics software engineerWeb DeveloperNetwork ProgrammerDatabase ProgrammerSystem Engineer / Analyst


12/04/23

BT versus IT

Bioinformatics including Biotechnology (BT) requires lot of Information Technology (IT) skills for Genomic annotation projects

Bioinformatics is one of the potential areas for IT professionals also

Genome Projects will be the next huge task for IT industries (like the Y2K problem in the past)

BT will take on IT soon … in the near future …


12/04/23

Conclusions

Developing Web based Bioinformatics tools Develop/modify useful algorithms Generate computer source codes Create/Maintain Web based server

Using existing Web based tools efficientlyBio-ethics & Bio-safety

Ensure always that any bioinformatics tool harmful to environment & society has neither been developed nor used by you


12/04/23

References (latest) N. C. Jones and P. A. Pevzner, An Introduction to

Bioinformatics Algorithms, Ane Books, New Delhi (2005).

Arthur M. Lesk, Introduction to Bioinformatics, Oxford University Press, New Delhi (2003).

D. Higgins and W. Taylor (Eds), Bioinformatics- Sequence structure and databanks, Oxford University Press, New Delhi (2000).

R.Durbin, S.R.Eddy, A.Krogh and G.Mitchison, Biological Sequence Analysis, Cambridge Univ. Press, Cambridge, UK (1998).

A. Baxevanis and B.F. Ouellette, Bioinformatics: A practical Guide to the Analysis of Genes and Proteins, Wiley-Interscience, Hoboken, NJ (1998).

Michael S. Waterman, Introduction to computational Biology, Chapman & Hall, (1995).

J. A. Clasel and M. P. Deutscher (Eds), Introduction to Biophysical Methods for Protein and Nucleic Acid Research, Academic press, New York (1995).


12/04/23

Lecture Notes

Available at ICGEBnet Distant homology

www.icgeb.org/~netsrv/courseware/Title.htm

Biorithmswww.icgeb.org/~netsrv/courseware/biorithms/

index.htm

http://www.icgeb.org/~netsrv/courseware/Title.htm

http://www.icgeb.org/~netsrv/courseware/Title.htm


12/04/23

Thank You

an introduction to

Technology

sequence similarity

structural bioinformatics

introduction biological

bioinformatics definition

biological data explosion

new sequences

protein chromosome

coil protein