sequence analysis

Sequence Analysis

Hemant KelkarCenter for Bioinformatics

University of North CarolinaChapel Hill, NC 27599

Scope of Series

Talk I

• Overview and BLAST

Talk II

• Protein analysis/Sequence Alignment

Talk III

• Evolution

• Genomics and challenges

Bioinformatics

• Mathematical, Statistical and computational methods that are used for solving biological problems

• Glue that holds the “omics” data together

Help …

• Is “my sequence” in the databases?• Is it similar to any sequence in the DB?• Does it have any know motifs/domains

that can help in identification?• Is there a structural homolog?• Are there any polymorphisms?• Genetic Map location?

Bioinformatics TOOLS!

Bioinformatics Tools

• Genetic Code

• Protein Structure

• Protein Evolution

Similarity search e.g. BLAST, FASTA

http://restools.sdsc.edu/biotools/biotools9.html

e.g. CLUSTALW, T-COFFEE, Phylip

Primary Sequence Databases

• GenBank (http://www.ncbi.nlm.nih.gov/Genbank/index.html

) • PIR (http://pir.georgetown.edu/) • Swiss-Prot (http://us.expasy.org/sprot/)

Sequence information as is generated in the laboratory

http://www.ncbi.nlm.nih.gov/Genbank/index.html

http://www.ncbi.nlm.nih.gov/Genbank/index.html

http://pir.georgetown.edu/

Derived Sequence Databases

• PFAM (http://www.sanger.ac.uk/Software/Pfam/) : Protein families based on HMM models

• InterPRO (http://www.ebi.ac.uk/interpro/) : Protein families and domains based on functional sites

• TransFac (http://www.gene-regulation.com/) transcription factor db

• Cytochrome P450 database (http://drnelson.utmem.edu/CytochromeP450.html)

Databases based on functional or phylogenetic analysis

Derived Sequence Databases

• Flybase (http://www.flybase.org/) : Fly Genome

• Wormbase (http://www.wormbase.org/) : C. elegans

• Genome Browser (http://genome.ucsc.edu/) :

Human and Mouse • MGI (http://www.informatics.jax.org/) : Mouse

• Microbial Genome Resource : (http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl)

Databases based on taxonomy

Sequence Alignments

• Provide a measure of relation between the nucleotide or protein sequence

• This allows us to decipher:

Structural relationships

Functional relationships

Evolutionary relationships

Sequence Similarity Searches

• Information conserved evolutionarily

• DNA sequences NOT coding for proteins/rRNAs diverge rapidly• When possible use protein sequences for similarity searches

• Non-homologous protein identification is much less reliable• What is measured and what is inferred?

Similarity

• Is always based on an observable

• Usually expressed as % identity

• Quantifies the divergence of two sequences

• substitutions/insertions/deletions

• Residues crucial for structure and/or function

Homology

• Homology always implies that the molecules share a common ancestor

• Absolute answer

• Molecules ARE or ARE NOT homologous

• No degrees

How to Find Similar Sequences

• Global Sequence Alignments

• Sequence comparison along entire length

• Homolog of similar length• Local Sequence Alignments

• Similar regions in two sequences

• Regions outside the local alignment excluded

• Sequences of different length/similarity

Dotplot

Scoring Matrices

• Empirical weighting schemes

• Considers important biology

• Side chain chemistry/structure/function

• Functional/Structural Conservation

• Ile/Val – small and hydrophobic

• Ser/Thr – both polar

• Size/Charge/Hydrophibicity

Nucleotide Matrix

A C G TA 5 -4 -4 -4C -4 5 -4 -4G -4 -4 5 -4T -4 -4 -4 5

PAM Scoring Matrices

• Margaret Dayhoff (1978)

• Point accepted mutations (PAM)

• Patterns of substitutions in highly related proteins (>85% identical), based on multiple sequence alignments

• New side chains must function similarly

• 1 PAM 1 AA change per 100 AA

• 1 PAM ~ 1 % Divergence

BLOSUM Matrices

• Henikoff and Henikoff (1992)

• Blocks Substitution Matrices

• Differences in conserved ungapped regions

• Directly calculated no extrapolations

• Sensitive to structural/functional subs

• Generally perform better for local similarity searches

Scoring Matrix – BLOSUM62

BLOSUM n

• Calculated from sequences sharing no more than n% identity

• Sequences with more than n% identity are clustered and weighted to 1• Reducing the value of “n” yields more divergent/distantly-related sequences • BLOSUM62 used as default by many of the online search sites

Matrices and more

PAM Matrices (Altschul, 1991)

PAM 40 Short alignments >70%

PAM120 >50%

PAM250 Longer weaker local areas >30%

BLOSUM Matrices (Henikoff, 1993)

BLOSUM 90 Short alignments >60%

BLOSUM 80 >50%

BLOSUM 62Commonly used >35%

BLOSUM 30 Longer, weaker local alignments

Gaps

• Compensate for insertion and deletions• Improvement alignments

• Must be kept to a reasonably small number • 1 per 20 residues is logical

• Need a different scoring scheme

Gap Penalties

• Penalty for gap introduction

• Penalty for Gap extension

where G = gap-opening penalty 511

L = Gap-extension penalty 21

n = Length of gap

Deductions for Gap = G + Ln

NucProt

BLAST

• Basic Local Alignment Search Tool

• Seeks high-scoring segment pair (HSP)

• Sequences that can be aligned w/o gaps

• have a maximal aggregate score

• score be above score threshold S• Many HSP reported for ungapped blast

BLAST Algorithms

Program Query TargetBLASTN Nucloetide NucleotideBLASTP Protein ProteinBLASTX Nucleotide Protein

(6-Frame)

TBLASTN Protein Nucleotide (6FR)TBLASTX Nucloetide(6FR) Nucloetide(6FR)

Neighborhood Words

Query: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE

STL13

SAL8

SNL8

SVL8

SBL7

SCL7

SDL7

Etc.

= 4 + 5 + 4

Neighborhood Score Threshold

(T = 8)

Query Word (W = 3)

High-Scoring Segment Pairs

STL13

SAL8

SNL8

SVL8

SBL7

SCL7

SDL7

Etc.Query: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE

++ G + ++G G+GKS+LLSA L L+ ++G + Sbjct: TVPQGCLLAVVGPVGAGKSSLLSALLGELSKVEGFVS

Extension

Significance Decay

• Mismatches

• Gap penalties

Extension

Cumulative Score

X

S

T

Query: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE ++ G + ++G G+GKS+LLSA L L+ ++G +

Sbjct: TVPQGCLLAVVGPVGAGKSSLLSALLGELSKVEGFVS

Karlin Altschul Equation

E = kmNe-λs

m Number of letters in query

N Number of letters in db

mN Size of search space

λs Normalized score

k minor constant

http://www.ncbi.nlm.nih.gov

sequence analysis

Documents

pir http

modelsinterpro http

multiple sequence

functional sitestransfac

mouse mgi http

fly genomewormbase http

elegansgenome browser

swissprot http