sequence alignment algorithms – application to bioinformatics tool development

72
Dr.S.Parthasarathy, Bhara thidasan Univ., Trichy 1 16 FEB 2006 Sequence Alignment Algorithms – Application to Bioinformatics Tool Development Dr. S. Parthasarathy Reader and Head Department of Bioinformatics Bharathidasan University Tiruchirappalli – 620 024 (E-mail: [email protected])

Upload: chin

Post on 24-Jan-2016

70 views

Category:

Documents


0 download

DESCRIPTION

Sequence Alignment Algorithms – Application to Bioinformatics Tool Development. Dr. S. Parthasarathy Reader and Head Department of Bioinformatics Bharathidasan University Tiruchirappalli – 620 024 (E-mail: [email protected] ). Plan. Introduction to Bioinformatics - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

116 FEB 2006

Sequence Alignment Algorithms – Application to Bioinformatics Tool

Development

Dr. S. ParthasarathyReader and Head

Department of BioinformaticsBharathidasan UniversityTiruchirappalli – 620 024

(E-mail: [email protected])

Page 2: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

216 FEB 2006

Plan

Introduction to BioinformaticsSequence alignment algorithms

Global alignment : Needleman - Wunsch algorithm Local alignment : Smith – Waterman algorithm

– Predict Fold to a protein sequence Methodology Algorithm, Coding & Tool Development Benchmarking

Conclusions

Page 3: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

316 FEB 2006

Introduction

Why do we need Bioinformatics?

What is Bioinformatics?

Where is Bioinformatics used?

Page 4: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

416 FEB 2006

Why?

Biological Data Explosion How did Biological Data Explosion

happen?

Sequence Databases are HUGE than the Structure Databases Why so?

Page 5: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

516 FEB 2006

Introduction Biological Data : Genome Projects

Latest Revolution On 26 June, 2000 - Announcement of completion of

the draft of the ‘Human Genome’ ‘Genetic Code of Human Life is Cracked by

Scientists’ Human Genome contains 3.2 x 109 bps Unit of (Genome) sequence length

bps (base pairs) Mbps (Mega base pairs) = 106 bps Gbps (Giga base pairs) = 109 bps huge (human genome equivalent) = 3.2 Gbps

Unit of Genetic distance centiMorgan (cM) - arbitrary unit ; Named for Thomas Hunt

Morgan (e.g. 1 cM = 0.01 recombinant frequency)

Page 6: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

616 FEB 2006

Introduction Biological Data : Genome Projects

16 February 2001 15 February 2001

Page 7: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

716 FEB 2006

Biological Data : Recombinant DNA Technology

Old Revolution 1940 – Role of DNA as the genetic material was confirmed 1953 – Discovery of DNA structure by James Watson & Francis Crick 1966 – Establishment of the Genetic Code 1967 – DNA ligase was isolated – (join two strands of DNA together) – Molecular Glue 1970 – Isolation of Restriction enzyme – Molecular Scissors 1972 – Recombinant DNA molecules were generated at Stanford University, USA 1973 – Joining DNA fragments to the plasmid pSC101 isolated from

E.Coli. They could replicate when introduced into E.Coli. The discoveries of 1972 & 1973 triggered off the biggest

scientific revolution – Genetic Engineering

Page 8: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

816 FEB 2006

Biological Data explosion

GenBank, NCBI, USA 44 Gbps of DNA & 40 Million Sequences (upto 2004) GenBank, National Center for Biotechnology Information, USA

Protein Data Bank (PDB), RCSB, USA 29,000 structures (2004) PDB, Research Collaboratory for Structural Bioinformatics,

USA QUALITY of Data - HIGH

Experimental error in modern genomic sequencing is extremely low

QUANTITY of Data - HUGE With Recombinant DNA technology & genomic sequencing,

size of sequence data bases is increasing very rapidly SEQUENCE Versus STRUCTURE Databases

Sequence Databases are HUGE than Structure DatabasesLeads to Bioinformatics

Page 9: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

916 FEB 2006

What?

What is Bioinformatics?

Define Bioinformatics

Page 10: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

1016 FEB 2006

Bioinformatics - Definition

F(i,j) = max {

F(i-1, j-1)+s(xi,yj),

F(i-1, j) – d,

F(i, j-1) – d.}Bioinformatics is an integration of mathematical, statistical and computer methods to analyze biological data. We use computer programs to make inference from the biological data, to make connections among them and to derive useful and interesting predictions.

The marriage of biology and computer science has created a new field called ‘Bioinformatics’. - Arthur M. Lesk

Bioinformat ics

PEPTIDESE QSEDITPEP

atcggcatgcatcagtcatgcaactg

Page 11: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

1116 FEB 2006

Biology Basic Definitions

Cell - It is the building block of living organisms Eukaryotic Cells or organisms have the nucleus

separated from the cytoplasm by a nuclear membrane and the genetic material borne on a number of chromosomes consisting of DNA and Protein

Chromosome The physical basis of heredity. Deeply staining rod-like structures present with the nuclei of

eukaryotesContains DNA and protein arranged in compact mannerReplicate identically during cell divisionSame number of chromosomes present in cells of a particular

species (e.g. Human : 22, X and Y)

Page 12: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

1216 FEB 2006

GenomeBasic Definitions

Genome A complete set of chromosomes inherited from one parent

Gene One of the units of inherited material carried on by

chromosomes. They are arranged in a linear fashion on DNAs. Each represents one character, which is recognized by its effect on the individual bearing the gene in its cells. There are many thousand genes in each nucleus.

DNA (Deoxyribo Nucleic Acid) DNA is made up of FOUR bases a t g c – adenine, thymine, guanine, cytosine

Protein Protein is made up of TWENTY different amino acids A T G C ... – Alanine, Threonine, Glycine, Cysteine, …

Page 13: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

1316 FEB 2006

Central Dogma

Protein

mRNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Page 14: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

1416 FEB 2006

Genome DataHuman & Model Organisms

Most mapping and sequencing technologies were developed from studies of simpler non-human organisms

Non-Human/Model organisms Bacterium Escherichia Coli - 4.6 Mbp Yeast Saccharomyces Cerevisiae - 12.1 Mbp Fruit Fly Drosophila melanogaster - 180.0 Mbp Roundworm C. elegans - 95.5 Mbp Laboratory Mouse Mus musculus - 3.0 Gbp

Human – more complex genome Human Homo sapiens - 3.2 Gbp

Page 15: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

1516 FEB 2006

Genome DataHuman (Homo Sapiens)

Genome 1

Chromosomes 23

Genes / DNAs ~ 30,000

Nucleotides 3.2 x 109 bps

Page 16: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

1616 FEB 2006

Bioinformatics in Genome Research

Data Collection and Interpretation Collecting and Storing Data

Sequence generated by genome research will be used as primary information source for human biology and medicine

The vast amount of data produced will first need to be collected, stored and distributed

Interpretation of Data Recognizing where genes begin and end Searching a database for a particular DNA sequence

may uncover these homologous sequences in a known gene from a model organism, revealing insights into the function of the corresponding human gene

Page 17: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

1716 FEB 2006

Understanding Gene Function

Correct protein function depends on the 3-D or folded structure the protein assumes in biological environments

Understanding protein structure will be essential in determining gene function

Gene Protein

Function Structure

Page 18: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

1816 FEB 2006

Where?

Where is Bioinformatics used?

What are the uses of Bioinformatics?

Applications of Bioinformatics

Page 19: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

1916 FEB 2006

Bioinformatics Tasks

Sequence Analysis (Protein sequences) Similarity & Homology

pairwise local/global alignment• GCG – Seqlab & Seqweb• Scoring Matrices - PAM, BLOSUM

Database Search BLAST, FASTA

Multiple alignmentClustalW, PRINTS, BLOCKS

Secondary Structure Prediction (from Sequence)Proteins – -Helix, β-Sheet, Turn or coilProtein Folding

Page 20: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

2016 FEB 2006

Bioinformatics Tasks Structure analysis – Experimental Determination

X-ray crystallography – 3 dimensional coordinates – Structure Nuclear Magnetic Resonance (NMR)

PDB – Protein Data Bank RasMol – Molecular Viewing Software

High-throughput crystallographic structure determination High flux synchrotron radiation sources (data collection) Multiple anomalous diffraction method (data interpretation)

Bioinformatics - Structure Prediction Homology Modelling – InsightII, SwissPDBViewer, Biosuite ‘ab initio’ method - Monte Carlo Simulation

Protein Structure Classification SCOP - Structural Classification Of Proteins CATH - Class, Architecture, Topology, Homologous superfamily FSSP - Fold Classification based on Structure- Structure alignment of Proteins – obtained by DALI (Distance-matrix ALIgnment)

Page 21: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

2116 FEB 2006

Bioinformatics Tasks

Protein Engineering Mutations

Alter particular amino acid/base for desired effect Site directed mutagenesis

Identify the potential sites where we can do alterations Applications

Agricultural – Genetically Modified Plants, Vegetables, GM Food

Pharmaceutical – Molecular Modelling base Drug Design Medical – Gene Therapy

DNA Bending Application to Genomes

(Ref: M.G.Munteanu, K.Vlahovicek, S.Parthasarathy, I.Simon and S.Pongor, Rod Models of DNA: Sequence-dependent anisotropic elastic modelling of local phenomena, Trends in Biochemical Sciences, 23 (1998) 341-347)

Page 22: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

2216 FEB 2006

Bioinformatics TasksGenomics & Proteomics

Genomics is the study of the structure, content, evolution and functions of genes in genomes

Aims of Genomics To establish an integrated web based

database and research interface To assemble Physical,Genetic and Cytological

maps of the Genome To identify and annotate the complete set of

genes encoded within a genome To provide the resources for comparison with

other genomes

Page 23: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

2316 FEB 2006

Proteomics – Proteome

Proteome is the complete collection of proteins in a cell/tissue/organism at a particular time. Unlike genomes, which are stable over the life time of the organism, proteomes change rapidly as each cell response to its changing environment and produces new proteins and at different amounts.

Genome is a more stable entity. An organism has only one genome but many proteomes.

For an organism, there may be one body wide proteome, about 200 tissue proteomes about a trillion (~1012) individual cell proteomes.

Page 24: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

2416 FEB 2006

The study of proteomes that includes determining the 3D shapes of proteins, their roles inside cells, the molecules with which they interact, and defining which proteins are present and how much of each is present at a given time.

Proteomics – Definition

Page 25: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

2516 FEB 2006

To correlate proteins on the basis of their expression profiles.

To observe patterns in protein synthesis and this observed pattern changes can be used as an indicator of the state of cell and its gene expression.

To characterize bacterial pathogens and to develop novel antimicrobials.

To identify regions of the bacterial genome that encode pathogenic determinants.

To develop drugs and in toxicology – Structural Proteomics

Proteomics as a tool for plant genetics and breeding

Proteomics – Applications

Page 26: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

2616 FEB 2006

Systems Biology Systems Biology is a new perspective and emerging

field for research in the post-genomic era. It aims at system level understanding of biological

systems. It studies whole cells/tissues/organisms not by a

traditional reductionist’s approach but by holistic means in a reiterative attempt to model the complete cell/tissue/organism.

It is an integrated and interacting network of genes, proteins and biochemical reactions which give rise to life.

Page 27: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

2716 FEB 2006

Systems Biology

Page 28: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

2816 FEB 2006

Sequence Alignment Algorithms

Similarity and Homology

Sequence Comparison - Issues

Types of alignments

Algorithms Used

Page 29: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

2916 FEB 2006

Sequence similarity and homology

Nature is a tinkerer and not an inventor. New sequences are adapted from pre-existing sequences rather than invented de novo . There exists significant similarity between a new sequence and already known sequences. – Fortunate for computational sequence analysis

Similarity – Measurement of resemblance and differences, independent of the source of resemblance.

Homology – The sequences and the organisms in which they occur are descended from a common ancestor.

If two related sequences are homologous, then we can transfer information about structure and/or function, by homology.

Page 30: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

3016 FEB 2006

3-D Structure and Homology

3-D structure patterns (motifs) of proteins are much more evolutionarily conserved than amino acid sequences - This type of Homology search could prove more fruitful

Particular motifs may serve similar functions in several different proteins, information that would be valuable in genome analysis

Only a few protein motifs can be recognised at the sequence level

Development of more analytic capabilities to facilitate grouping protein sequences into motif families will make homology searches more useful

Page 31: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

3116 FEB 2006

Sequence ComparisonIssues

Types of alignment Global – end to end matching (Needleman-Wunsch) Local – portions or subsequences matching (Smith-

Waterman) Scoring system used to rank alignments

PAM & BLOSUM matrices Algorithms used to find optimal (or good) scoring

alignments Heuristic Dynamic Programming Hidden Markov Model (HMM)

Statistical methods used to evaluate the significance of an alignment score Z- score, P- value and E- value

Page 32: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

3216 FEB 2006

Substitution Matrices

PAM (Point Accepted Mutation) BLOSUM (BLOcks SUbstitution

Matrix)90

62

30

BLOSUM

Close

Default

Distant

40

250

500

PAM

Page 33: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

3316 FEB 2006

Types of Algorithms Heuristic

A heuristic is an algorithm that will yield reasonable results, even if it is not provably optimal or lacks even a performance guarantee.

In most cases, heuristic methods can be very fast, but they make additional assumptions and will miss the best match for some sequence pairs.

Dynamic Programming The algorithm for finding optimal alignments given an

additive alignment score dynamically (We are going to discuss about it soon.) These type of algorithms are guaranteed to find the

optimal scoring alignment or set of alignments. HMM - Based on Probability Theory – very versatile.

Page 34: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

3416 FEB 2006

Global AlignmentNeedleman-Wunsch Algorithm

Formula { F(i-1,j-1) + s(xi,yj)

D F(i, j) = max { F(i-1 , j) - d

H { F(i , j-1) - d

V F(i-1,j-1) D

F(i,j-1) V

F(i-1,j) H F(i,j)

Page 35: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

3516 FEB 2006

Global AlignmentNeedleman-Wunsch Algorithm

Gap penalties Linear score f(g) = - gd Affine score f(g) = - d – (g-1) e

d = gap open penalty e = gap extend penaltyg = gap length

Trace back Take the value in the bottom right corner and

trace back till the end. (i.e. align end – end always).

Algorithm complexity It takes O(nm) time and O(nm) memory, where

n and m are the lengths of the sequences.

Page 36: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

3616 FEB 2006

Local AlignmentSmith-Waterman Algorithm

Same as Global alignment algorithm with TWO differences. F(i,j) to take 0 (zero), if all other options

have value less than 0. Alignment can end anywhere in the

matrix.Take the highest value of F(i,j) over the wholematrix and start trace back from there.

Page 37: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

3716 FEB 2006

Local AlignmentSmith-Waterman Algorithm

Formula { F(i-1,j-1) + S(xi,yj) D

F(i, j) = max F(i-1 , j) - d H F(i , j-1) - d V 0 (if all other value is <

0) }

F(i-1,j-1) D V F(i,j-1)

F(i-1,j) H F(i,j)

Page 38: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

3816 FEB 2006

Web based server development

Design the web page to get the dataUse cgi-bin or Perl script to parse the

submitted dataInvoke the corresponding program to

get the appropriate resultsSend the results either by e-mail or

to the web page directly

Page 39: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

3916 FEB 2006

Application to Bioinformatics Tool Development

To predict a fold to protein sequence

Page 40: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

4016 FEB 2006

To predict a fold to protein sequence

To predict possible folds for a given protein sequence, whose structure is not known

To develop a fold recognition technique / tool that is sensitive in detecting folds of given protein sequences in the twilight zone (sequences sharing less than 25% identity)

Application of the fold recognition strategy to genomic annotation

Page 41: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

4116 FEB 2006

‘Twilight Zone’ sequencesExample

Cytochrome Sequences

256b>256B:A CYTOCHROME $B562 (OXIDIZED) - CHAIN A

ADLEDNMETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPPKLEDKSPDSPEMKD FRHGFDILVGQIDDALKLANEGKVKEAQAAAEQLKTTRNAYHQKYR

>256B:B CYTOCHROME $B562 (OXIDIZED) - CHAIN B ADLEDNMETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPPKLEDKSPDSPEMKD FRHGFDILVGQIDDALKLANEGKVKEAQAAAEQLKTTRNAYHQKYR

2ccy >2CCY:A CYTOCHROME $C(PRIME) - CHAIN A

QQSKPEDLLKLRQGLMQTLKSQWVPIAGFAAGKADLPADAAQRAENMAMVAKLAPIGWAK GTEALPNGETKPEAFGSKSAEFLEGWKALATESTKLAAAAKAGPDALKAQAAATGKVCKA CHEEFKQD

>2CCY:B CYTOCHROME $C(PRIME) - CHAIN B QQSKPEDLLKLRQGLMQTLKSQWVPIAGFAAGKADLPADAAQRAENMAMVAKLAPIGWAK GTEALPNGETKPEAFGSKSAEFLEGWKALATESTKLAAAAKAGPDALKAQAAATGKVCKA CHEEFKQD

Page 42: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

4216 FEB 2006

ExampleSequences similarity

lalign output for

256b & 2ccy follows …

Page 43: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

4316 FEB 2006

ExampleCytochrome Structures

CYTOCHROME STRUCTURES (seq. similarity 24%)CYTOCHROME STRUCTURES (seq. similarity 24%)

256b256b

2ccy2ccy

Page 44: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

4416 FEB 2006

Goals

Exploration of suitable fold recognition techniques that are sensitive in detecting similar folds despite low sequence similarity

Identification of functional motifs in proteins at sequence (1D) and structure (3D) level

Development of a protocol that aid in the rapid classification and annotation of genomic data based on functional motifs

Page 45: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

4516 FEB 2006

Methodology

Reduction of 3D-structure to 1D-environment string. Environment at each residue position is a function of local secondary structure and extent of exposure to the solvent (based on 3D-1D profile method developed by Eisenberg et al., 1991).

Extract residue environment profiles of the available protein structures.

A scoring matrix is generated from a library of profiles. Each matrix element is the information value of a residue in the given environment.

A library of environment strings is created for the available protein fold structures.

The probe sequence is queried against this library to look for best matches.

Page 46: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

4616 FEB 2006

Scoring Table

PredictFold

Fold library

3D-Profiles

New Sequence

FOLD PREDICTION

1D-Environment Sequence

Annotate New sequence

Workflow

Page 47: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

4716 FEB 2006

Residue Environments

_Exposed Partially buried

Buried_

_Helix

_CoilStrand_

Page 48: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

4816 FEB 2006

Residue Environments

The residue environments are described by 1. the area (A) of the

residue buried in the protein

2. the fraction (f) of side-chain area that is covered by polar atoms (O and N)

3. the local secondary structure

Page 49: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

4916 FEB 2006

Residue Environments

CLASS Area (A) Å2 FRACTION (f)

BURIED 1 (B1) A > 114 f < 0.45BURIED 2 (B2) 0.45 < f < 0.58BURIED 3 (B3) f > 0.58

PARTIAL 1 (P1) 40 < A < 114 f < 0.67PARTIAL 2 (P2) f > 0.67

EXPOSED (E0) A < 40 f > 0.67

Page 50: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

5016 FEB 2006

Residue Environment classes

We have 6 classes based on the extend of exposure to solvent

We have 3 classes based on secondary structure – Alpha Helix(A), Beta Sheet (B) & Coil(C)

Total : 6 x 3 = 18 environmentsB1A,B1B,B1C, B2A,B2B,B2C, B3A,B3B,B3C

P1A,P1B,P1C, P2A,P2B,P2C, E0A,E0B,E0C.For example

B1A - Buried 1 Alpha Helix P2B - Partially Buried 2 Beta Sheet E0C - Exposed 0 Coil

Page 51: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

5116 FEB 2006

Scoring Table The scoring table used in this case is a 20 x 18

matrix, constructed from a statistical analysis of the profile library (consisting of 1200 protein structures) provided by PROFILES_3D module of Insight II (Accelrys Inc.)

The scores Sij are calculated using the formula

Sij = ln [ P(i : j) / Pi ] x 100 where P(i : j) is the probability of finding

residue i in the environment j and Pi is the overall probability of finding residue i in any environment.

Page 52: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

5216 FEB 2006

Scoring Table

The scoring table contains measure of the compatibility of the 20 amino acids with the 18 environmental classes.

The individual matrix elements are propensities (information values) for the amino acid residues.

Page 53: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

5316 FEB 2006

Scoring Table

Class A C D E F G H I K L M N P Q R S T V W Y

B1A -77 -43 -248 -215 128 -222 -34 111 -137 130 126 -176 -156 -138 -180 -243 -172 74 111 27B1B -105 -45 -180 -159 96 -235 -226 150 -304 107 51 -218 -77 -203 -152 -256 -127 118 92 17B1C -54 -59 -263 -201 140 -278 -61 93 -278 106 91 -261 59 -84 -235 -299 -141 100 96 52B2A -65 15 -80 -58 87 -204 82 55 -94 71 102 -48 -97 16 -11 -133 -67 41 101 86B2B -152 -72 -208 -20 132 -222 -5 107 -83 36 49 -26 -86 -79 -41 -82 -114 71 83 130B2C -81 -62 -197 -113 104 -171 54 81 -212 77 100 -56 -7 -87 -44 -123 -103 66 162 114B3A -44 -138 -43 9 -22 -109 61 -2 56 16 87 -7 -111 16 110 -101 -69 -29 86 50B3B -108 -79 -76 -42 37 -229 80 26 35 14 -68 -33 -1 52 84 -71 -10 16 7 109B3C -87 -98 -83 -46 71 -61 104 -54 8 29 23 9 -11 10 71 -61 -48 -40 112 125P1A 76 95 -28 -43 -85 -46 -91 -6 -50 -30 -42 -58 -41 -32 -51 47 39 30 -129 -88P1B 59 128 -61 -68 -61 -22 -53 9 -201 -81 -40 -92 -65 -238 -89 49 95 44 34 -9P1C 49 129 34 -59 -129 -39 -121 -28 -72 -33 -90 -26 64 -57 -88 59 55 -9 -125 -140P2A -2 -70 29 62 -135 -58 17 -59 66 -46 -27 -2 -25 62 56 -38 -13 -62 -109 -55P2B -52 -87 -3 41 -56 -87 -49 -35 55 -133 -76 0 -101 10 19 49 79 8 -71 -30P2C -25 -81 51 28 -84 -42 20 -94 47 -68 -83 51 44 25 24 17 8 -74 -42 -43E0A 44 -17 44 60 -181 63 -6 -236 7 -137 -90 32 5 29 -20 16 -20 -125 -126 -170E0B 14 -4 -30 -37 -83 175 -76 -139 -154 -160 -62 1 -88 -12 -112 65 -17 -166 81 -3E0C 14 -35 23 4 -163 110 -41 -163 -10 -114 -130 41 25 -3 -41 34 8 -80 -206 -104

Page 54: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

5416 FEB 2006

1565 Functional forms

Scan PDB to identify all the structures having these folds

Identify a representative structure with resolution 2.5Å or better

Quality of the structure

(Occupancy, R-Factor, Stereochemistry)

968 Chains

Fold Library

Page 55: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

5516 FEB 2006

DALI / FSSP Fold Library

DALI : http://www.ebi.ac.uk/dali Touring protein fold space with

DALI/FSSP. Lisa Holm and Chris Sander, Nucleic Acid Research, (1998), 26, 316-319

Mapping the Protein Universe, Lisa Holm and Chris Sander, Science, (1996), 273, 595-602

Page 56: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

5616 FEB 2006

Sequence ComparisonDetails

Type of Alignment Local - portions or subsequences

matching Smith-Waterman Algorithm

Scoring Table : 3D-1D matrixAlgorithm used : Dynamic

ProgrammingAlignment Score : Z- Score

Page 57: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

5716 FEB 2006

Local AlignmentSmith-Waterman Algorithm

Formula { F(i-1,j-1) + S(xi,yj) D

F(i, j) = max F(i-1 , j) - d H F(i , j-1) - d V 0 (if all other value is <

0) }

F(i-1,j-1) D V F(i,j-1)

F(i-1,j) H F(i,j)

Page 58: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

5816 FEB 2006

Gap Penalties

Gap penalties Linear score f(g) = - gd Affine score f(g) = - d – (g-1) e

d = gap open penalty e = gap extend penalty

g = gap length

Gap penalty values used are d = 500 e = 50

Page 59: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

5916 FEB 2006

Local Alignment

Trace back Alignment can end anywhere in the

matrix Take the highest value of F(i,j) over the

whole matrix and start trace back from there.

Algorithm complexity It takes O(nm) time and O(nm) memory,

where n and m are the lengths of the sequences.

Page 60: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

6016 FEB 2006

Significance of an Alignment Score

Statistical methods used to evaluate the significance of an alignment score Z-score, P-value and E-value

Significance of Score Z- score = (score – mean)/std. dev

Measures how unusual our original match is. Z 5 are significant.

P- value measures probability that the alignment is no better than random. (Z and P depends on the distribution of the scores)

P 10-100 exact match. E- value is the expected number of sequences that give

the same Z- score or better. (E = P x size of the database)

E 0.02 sequences probably homologous

Page 61: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

6116 FEB 2006

Benchmarking

All 968 proteins in the fold library were profiled on each of the other members

A histogram indicating the rank and the number of sequences which got the self score as the highest, is shown in Figure.

Page 62: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

6216 FEB 2006

Benchmarking

797

633 3 2 17 29 54

0

200

400

600

800

1000

1 2 3 4 5 6 7 8

Rank

No

. o

f S

equ

ence

s

Page 63: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

6316 FEB 2006

Benchmarking

Report 797 retain the self as the highest score 63 report the self to have the second highest score There were about 100 proteins that have ranks

between 5 and 100. Limitations

Prediction is restricted to the 968 folds in the library The algorithm is insensitive to partially folded

sequences Specific to globular proteins and not for membrane

proteins Sequences that fold in the presence of cofactors and

ligands are not accounted for

Page 64: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

6416 FEB 2006

Web based server development

Design the web page to get the dataUse cgi-bin or Perl script to parse the

submitted dataInvoke the corresponding program to get

the appropriate resultsSend the results either by e-mail or to the

web page directlyPrepare a ‘user manual’ to describe the

salient features of the server

Page 65: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

6516 FEB 2006

Conclusions

PredictFold – A program to predict possible folds for a new protein sequence based on the 3D-1D profile method

Benchmarking results show the reliability of the method

There are lot of scopes for further improvements

Page 66: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

6616 FEB 2006

Future Directions

To update the fold library by including more known folds

To use the predicted secondary structure information of the given sequence also

To optimise the source code for efficient handling of genome sequences, automatically

To combine results from other algorithms ORF, HMM, etc. to detect remote homologs

To develop & maintain a web-based sever for fold recognition

Page 67: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

6716 FEB 2006

BT versus IT

Bioinformatics including Biotechnology (BT) requires lot of Information Technology (IT) skills for Genomic annotation projects

Bioinformatics is one of the potential areas for IT professionals also

Genome Projects will be the next huge task for IT industries (like the Y2K problem in the past)

BT will take on IT soon … in the near future …

Page 68: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

6816 FEB 2006

Conclusions

Developing Web based Bioinformatics tools Develop/modify useful algorithms Generate computer source codes Create/Maintain Web based server

Using existing Web based tools efficientlyEthical issues

Bioethics & Biosafety : Ensure always that any bioinformatics tool harmful to environment & society has neither been developed nor been used by you

Cloning of human, Terminator technology, GM Food, etc.

Page 69: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

6916 FEB 2006

References (latest) Arthur M. Lesk, Introduction to Bioinformatics, Oxford University

Press, New Delhi (2003). D. Higgins and W. Taylor (Eds), Bioinformatics- Sequence

structure and databanks, Oxford University Press, New Delhi (2000).

R.Durbin, S.R.Eddy, A.Krogh and G.Mitchison, Biological Sequence Analysis, Cambridge Univ. Press, Cambridge, UK (1998).

A. Baxevanis and B.F. Ouellette, Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, (Third Edition) Wiley-Interscience, Hoboken, NJ (2005).

G.Gibson and S.V.Muse, A Primer of Genome Science, Sinauer Associates, USA (2002).

N. C. Jones and P. A. Pevzner, An Introduction to Bioinformatics Algorithms, Ane Books, New Delhi (2005).

Michael S. Waterman, Introduction to computational Biology, Chapman & Hall, (1995).

J. A. Clasel and M. P. Deutscher (Eds), Introduction to Biophysical Methods for Protein and Nucleic Acid Research, Academic press, New York (1995).

D.S. T.Nicholl, An Introduction to Genetic Engineering, (Second Edition) Cambrdige Univ. Press, UK (2002).

Page 70: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

7016 FEB 2006

References

3D-1D Profile method J.U.Bowie, E.Luthy & D.Eisenberg, Science, 253,

164-170 (1991).Ostensible Recognition of Folds (ORF)

method Rajeev Aurora and George D.Rose, Proc. Natl.

Acad. Sci. (USA), 95(6), 2818-2823 (1998).Superfamily Hidden Markov Model (SHMM)

method A.Krogh, M.Brown, IS.Mian, K.Sjolander and

D.Haussler, J. Mol. Biol. 235(5), 1501-31 (1994).

Page 71: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

7116 FEB 2006

ImportantBioinformatics Resources

Databases & Tools NCBI, NIH - www.ncbi.nlm.nih.gov EMBL, EBI - www.ebi.ac.uk ExPasy, Swiss - www.expasy.org DDBJ - www.ddbj.nig.ac.jp PDB - www.rcsb.org/pdbSoftware Accelrys - www.accelrys.com/products

GCG, Insight II, Cerius II, Discovery Studio TCS - www.atc.tcs.co.in/biosuite/

BIOSUITE Jalaja Technologies - www.jalaja.com

GENOCLUSTER

Page 72: Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

7216 FEB 2006

Thank You