an introduction to

35
Dr.S.Parthasarathy, NIT, Trichy 1 20/06/22 An introduction to BIO informatics AlgoRITHMS Dr. S. Parthasarathy National Institute of Technology Tiruchirappalli – 620 015 (E-mail: [email protected])

Upload: guest3bd2a12

Post on 11-May-2015

1.736 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy1

12/04/23

An introduction to BIOinformatics AlgoRITHMS

Dr. S. ParthasarathyNational Institute of Technology

Tiruchirappalli – 620 015(E-mail: [email protected])

Page 2: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy2

12/04/23

Plan

IntroductionOverview of BioinformaticsBioinformatics Algorithms

Pairwise Sequence Alignment Database Search

PickFold – Sequence to Protein FoldFuture Perspective of Biological Crisis

Management – Bioinformatics point of viewConclusion

Page 3: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy3

12/04/23

IntroductionBiological Data

On 26 June, 2000 - Announcement of completion of the draft of the ‘Human Genome’

Human Genome contains 3.2 x 109 bpsUnits of (Genome) sequence length

bps (base pairs) Mbps (Mega base pairs) = 106 bps Gbps (Giga base pairs) = 109 bps huge (human genome equivalent) = 3.2 Gbps

Page 4: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy4

12/04/23

Growth of GenBank

0

4000

8000

12000

16000

20000

1991 1993 1995 1997 1999 2001

Year

Cum

ulat

ive

Size

(Mbp

s)

Growth of Protein Data Bank

0

4000

8000

12000

16000

20000

1991 1993 1995 1997 1999 2001

Year

Cum

ulat

ive

Size

Biological Data Explosion

Page 5: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy5

12/04/23

Biological Data explosion

GenBank, NCBI, USA --- 16 Gbps GenBank, National Center for Biotechnology Information, USA

PDB, RCSB, USA --- 16,000 structures PDB, Research Collaboratory for Structural Bioinformatics,

USA

QUALITY - HIGH Experimental error in modern genomic sequencing is

extremely low QUANTITY - HUGE

With genomic sequencing & Recombinant DNA technology, size of sequence data bases is increasing very rapidly.

Page 6: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy6

12/04/23

Bioinformatics - Definition

F(i,j) = max {

F(i-1, j-1)+s(xi,yj),

F(i-1, j) – d,

F(i, j-1) – d.}

Bioinformatics is an integration of mathematical, statistical and computer methods to analyze biological data. We use computer programs to make inference from the biological data, to make connections among them and to derive useful and interesting predictions.

The marriage of biology and computer science has created a new field called ‘Bioinformatics’.

Page 7: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy7

12/04/23

Bioinformatic Goals

To understand integrative aspects of the biology of organisms, viewed as coherent complex structures

To interrelate sequences, 3-D structures, interactions and functions of proteins, nucleic acids and their complexes

To study the evolution of biological systems

To support applications in agricultural, pharmaceutical and other scientific fields

Page 8: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy8

12/04/23

Biological SystemsOverview

SPECIES

ECO SYSTEMS

ORGANISMS

CELLS

BIOSPHERE

Page 9: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy9

12/04/23

Biology Basic Definitions

Cell - It is the building block of living organisms Eukaryotic Cells or organisms have the nucleus

separated from the cytoplasam by a nuclear membrane and the genetic material borne on a number of chromosomes consisting of DNA and Protein

Chromosome The physical basis of heredity. Deeply staining rod-like structures present with the nuclei of

eukaryotesContains DNA and protein arranged in compact mannerReplicate identically during cell divisionSame number of chromosomes present in cells of a particular

species (e.g. Human : 22, X and Y)

Page 10: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy10

12/04/23

GenomeBasic Definitions

Gene One of the units of inherited material carried on by chromosomes.

They are arranged in a linear fashion on DNAs. Each represents one character, which is recognized by its effect on the individual bearing the gene in its cells. There are many thousand genes in each nucleus.

Genome A set of chromosomes inherited from one parent

DNA Deoxyribo Nucleic Acid made up of FOUR bases a t g c – adenine, thymine, guanine, cytosine

Proteins made up of TWENTY different aminoacids A T G C … – Alanine, Threonine, Glycine, Cysteine, …

Page 11: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy11

12/04/23

Bioinformatics Tasks

Sequence Analysis Similarity & Homology –

pairwise local/global alignment• Scoring Matrices - PAM, BLOSUM

Database Search BLAST, FASTA, GCG

Multiple alignmentClustalW, PRINTS, BLOCKS

Secondary Structure PredictionProteins – -Helix, β-Sheet, Turn or coilProtein Folding

Page 12: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy12

12/04/23

Bioinformatics Tasks

Structure analysis X-ray crystallograpy – 3 dimensional coordinates

– StructurePDB – Protein Data Bank RasMol – Molecular Viewing Software

Protein Structure Databases SCOP - Structural Classification Of Proteins

CATH - Class, Architecture, Topology, Homologous superfamily

FSSP - Fold Classification based on Structure- Structure alignment

of Proteins – obtained by DALI (Distance-matrix ALIgnment)

Page 13: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy13

12/04/23

Bioinformatics Tasks

Protein Engineering Mutations

Alter particular aminoacid/base for desired effect Site directed mutagenesis

Identify the potential sites where we can do alterations

DNA Bending Application to Genomes

Page 14: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy14

12/04/23

Sequence similarity, homology and alignments

Nature is a tinkerer and not an inventor. New sequences are adapted from pre-existing sequences rather than invented de novo . There exists significant similarity between a new sequence and already known sequences.

Similarity – Measurement of resemblance and differences, independent of the source of resemblance.

Homology – The sequences and the organisms in which they occur are descended from a common ancestor.

If two related sequences are homologous, then we can transfer information about structure and/or function, by homology.

Page 15: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy15

12/04/23

Sequence Comparison Issues

Types of alignment Global – end to end matching (Needleman-Wunsch) Local – portions or subsequences matching (Smith-

Waterman) Scoring system used to rank alignments

PAM & BLOSUM matrices Algorithms used to find optimal (or good) scoring alignments

Heuristic Dynamic Programming Hidden Markov Model (HMM)

Statistical methods used to evaluate the significance of an alignment score

Z-score, E-value, etc.

Page 16: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy16

12/04/23

Substitution Matrices

PAM (Point Accepted Mutation) BLOSUM (BLOcks SUbstitution

Matrix)90

62

30

BLOSUM

Close

Default

Distant

40

250

500

PAM

Page 17: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy17

12/04/23

Types of Algorithms Heuristic

A heuristic is an algorithm that will yield reasonable results, even if it is not provably optimal or lacks even a performance guarantee.

In most cases, heuristic methods can be very fast, but they make additional assumptions and will miss the best match for some sequence pairs.

Dynamic Programming The algorithm for finding optimal alignments given an

additive alignment score dynamically (We are going to discuss about it soon.) These type of algorithms are guaranteed to find the

optimal scoring alignment or set of alignments. HMM - Based on Probability Theory – very versatile.

Page 18: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy18

12/04/23

Global AlignmentNeedleman-Wunsch Algorithm

Formula F(i-1,j-1) + s(xi,yj)

D F(i, j) = max sF(i-1 , j) - d H F(i , j-1) - d V

F(i-1,j-1) D

F(i,j-1) V

F(i-1,j) H F(i,j)

Page 19: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy19

12/04/23

Global AlignmentNeedleman-Wunsch Algorithm

Gap penalties Linear score f(g) = - gd Affine score f(g) = - d – (g-1) e

d = gap open penalty e = gap extend penaltyg = gap length

Trace back Take the value in the bottom right corner and

trace back till the end. (i.e. align end – end always).

Algorithm complexity It takes O(nm) time and O(nm) memory, where

n and m are the lengths of the sequences.

Page 20: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy20

12/04/23

Local AlignmentSmith-Waterman Algorithm

Same as Global alignment algorithm with TWO differences.1. F(i,j) to take 0 (zero), if all other

options have value less than 0.2. Alignment can end anywhere in the

matrix. Take the highest value of F(i,j) over the

whole matrix and start trace back from there.

Page 21: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy21

12/04/23

Sequence Database Search

Heuristic sequence database searching packages BLAST & FASTA

Significance of Score Z – score = (score – mean)/std. dev

Measures how unusual our original match is. Z 5 are significant.

P – value measures probability that the alignment is no better than random. (Z and P depends on the distribution of the scores)

P 10-100 exact match. E – value is the expected number of sequences that give

the same Z-score or better. (E = P x size of the database)

E 0.02 sequences probably homologous

Page 22: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy22

12/04/23

Web based server development

Design the web page to get the dataUse cgi-bin or Perl script to parse the

submitted dataInvoke the corresponding program to

get the appropriate resultsSend the results either by e-mail or

to the web page directly

Page 23: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy23

12/04/23

PickFold

Predict a fold for an amino acid sequence

To develop a fold recognition technique that is sensitive in detecting folds of sequences in the twilight zone (sequences sharing less than 25% identity).

Page 24: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy24

12/04/23

Workflow

Scoring Table

PickFold

Fold library

3D-Profiles

New Sequence

FOLD PREDICTION

1D-Environment Sequence

Annotate New sequence

Page 25: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy25

12/04/23

PickFold

Sequence to Protein FoldFollows …

Page 26: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy26

12/04/23

Biological Crisis Management

Future Perspective of Biological Crisis Management

Follows …

Page 27: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy27

12/04/23

Applications ofBioinformatics

Agricultural Genetically Modified Plants, Vegetables GM Food

Pharmaceutical Molecular Modelling based Drug

DiscoveryMedical

Gene Therapy

Page 28: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy28

12/04/23

Bioinformatics Skills

Algorithm development Coding – Testing – Documentation

Programming Skills in C, C++, Java, … Data Structures – Sorting, Searching, Statistics &

Probability

Database Management Creation, Compilation, Updation & Web based search

CGI bin scripts, Java Scripts, Perl, JDBC, ASP, ...

Graphics 2D & 3D graphics - GUI

Web page design & Automatic Web servers Java Applets, Java Scripts, Java Servlets, RMI, …

Commercial Products - Package/ Tools – Sales !!

Page 29: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy29

12/04/23

ImportantBioinformatics Resources

NCBI, NIH - www.ncbi.nlm.nih.govEMBL, EBI - www.ebi.ac.ukExPasy, Swiss - www.expasy.orgDDBJ - www.ddbj.nig.ac.jpPDB - www.rcsb.org/pdbGCG - www.gcg.com

Page 30: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy30

12/04/23

BIOINFORMATICS JOBS

Bioinformatics Scientist / AnalystBio-programmerBioinformatics software engineerWeb DeveloperNetwork ProgrammerDatabase ProgrammerSystem Engineer / Analyst

Page 31: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy31

12/04/23

BT versus IT

Bioinformatics including Biotechnology (BT) requires lot of Information Technology (IT) skills for Genomic annotation projects

Bioinformatics is one of the potential areas for IT professionals also

Genome Projects will be the next huge task for IT industries (like the Y2K problem in the past)

BT will take on IT soon … in the near future …

Page 32: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy32

12/04/23

Conclusions

Developing Web based Bioinformatics tools Develop/modify useful algorithms Generate computer source codes Create/Maintain Web based server

Using existing Web based tools efficientlyBio-ethics & Bio-safety

Ensure always that any bioinformatics tool harmful to environment & society has neither been developed nor used by you

Page 33: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy33

12/04/23

References (latest) N. C. Jones and P. A. Pevzner, An Introduction to

Bioinformatics Algorithms, Ane Books, New Delhi (2005).

Arthur M. Lesk, Introduction to Bioinformatics, Oxford University Press, New Delhi (2003).

D. Higgins and W. Taylor (Eds), Bioinformatics- Sequence structure and databanks, Oxford University Press, New Delhi (2000).

R.Durbin, S.R.Eddy, A.Krogh and G.Mitchison, Biological Sequence Analysis, Cambridge Univ. Press, Cambridge, UK (1998).

A. Baxevanis and B.F. Ouellette, Bioinformatics: A practical Guide to the Analysis of Genes and Proteins, Wiley-Interscience, Hoboken, NJ (1998).

Michael S. Waterman, Introduction to computational Biology, Chapman & Hall, (1995).

J. A. Clasel and M. P. Deutscher (Eds), Introduction to Biophysical Methods for Protein and Nucleic Acid Research, Academic press, New York (1995).

Page 34: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy34

12/04/23

Lecture Notes

Available at ICGEBnet Distant homology

www.icgeb.org/~netsrv/courseware/Title.htm

Biorithmswww.icgeb.org/~netsrv/courseware/biorithms/

index.htm

Page 35: An Introduction To

Dr.S.Parthasarathy, NIT, Trichy35

12/04/23

Thank You