effect of gap penalty on local alignment score:score: 161 at (seq1)[2..36] : (seq2)[53..90] 2...

35
Effect of gap penalty on Local Alignment Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | ||||||||||||||||||||||||||| 53 ASSVSVGATEA-EPTEVFMDLWPEDHSNWQELSPLEPSD Score: 156 at (seq1)[10..36] : (seq2)[64..90] 10 EPTEVFMDLWPEDHSNWQELSPLEPSD ||||||||||||||||||||||||||| 64 EPTEVFMDLWPEDHSNWQELSPLEPSD Blosum 62 Gap: - 15 Ex: -3 Gap: -5 Ex: - 1

Upload: eleanore-bishop

Post on 11-Jan-2016

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Effect of gap penalty on Local Alignment

Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | ||||||||||||||||||||||||||| 53 ASSVSVGATEA-EPTEVFMDLWPEDHSNWQELSPLEPSD

Score: 156 at (seq1)[10..36] : (seq2)[64..90] 10 EPTEVFMDLWPEDHSNWQELSPLEPSD ||||||||||||||||||||||||||| 64 EPTEVFMDLWPEDHSNWQELSPLEPSD

Blosum 62

Gap: -15

Ex: -3

Gap: -5 Ex: - 1

Page 2: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Finding the most “profitable” route from a complete map

M A T W L I . .G1

M S11

A S22 s32 S42

W s33 S43

T S44 S54

VA

Bioinformatics (Mount) p65-74

Needleman –Wunsch Algorithm

Smith-Waterman Algorithm

Page 3: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Smith-Waterman vs. Heuristic approach

M A T W L I

G1 .. .. .. .. ..

M .. S11 .. .. .. .. ..

A .. s12 S22 s32 S42 .. ..

W .. .. .. s33 S34 .. ..

T .. .. .. .. .. .. ..

V .. .. .. .. .. .. ..

A .. .. .. .. .. .. ..

Smith-Waterman Algorithm

Finding the best alignment based on complete calculation of route map

BLAST, FASTA

Try to find the best alignment based on sounding assumptions.

Rule of thumb:

• A high similarity core

• Often without gap

Page 4: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

BLAST – Basic Local Alignment Search Tool

It is based on local alignment, -- highest score is the only priority in terms of finding alignment match.

-- determined by scoring matrix, gap penalty

It is optimized for searching large data set instead of finding the best alignment for two sequences

Page 5: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

BLAST – Basic Local Alignment Search Tool

1. A high similarity core (2-4aa)

2. Often without gap

Query: M A T W L I . Word : M A T A T W T W L

W L I

1. For each word, find matches with Score > T.

2. Extend the match as long as profitable.

- High Scoring segment Pair (best local alignment)

3. Find the P and E value for HSP(s) with Score > cut off*.

* Cut off value can be automatically calculated based on E

Page 6: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

BLAST – Basic Local Alignment Search Tool

The P and E value for HSP(s) : based on the total score (S) of the identified “best” local alignment.

P (S) : the probability that two random sequence, one the length of the query and the other the entire length of the database, could achieve the score S.

E (S) : The expectation of observing a score >= S in the target database.

For a given database, there is a one to one correspondence between S and E(s) -- choosing E determines cut off score

Page 7: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

BLAST – Basic Local Alignment Search Tool

BLASTNBLASTPTBLASTN

compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.BLASTX  compares a nucleotide query sequence translated in all reading frames against a protein sequence database TBLASTX compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx program cannot be used with the nr database on the BLAST Web page.

Page 8: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

BLAST – Advanced options : all adjustable in stand alone BLAST

-F Filter query sequence [String] default = T

-M Matrix [String] default = BLOSUM62

-G  Cost to open gap [Integer] default = 5  for nucleotides 11

proteins

-E  Cost to extend gap [Integer] default = 2 nucleotides 1 proteins

-q Penalty for nucleotide mismatch [Integer] default = -3

-r reward for nucleotide match [Integer] default = 1

-e expect value [Real] default = 10

-W wordsize [Integer] default = 11 nucleotides 3 proteins

-T Produce HTML output [T/F] default = F

…..

Page 9: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Smith-Waterman vs. Heuristic approach

Smith-Waterman Algorithm

Finding the best alignment based on complete route map

BLAST, FASTA

Try to find the best alignment based on experience/knowledge

Search result Search result

Difference ?

- For real good matches, almost no difference

- For marginal similarity and exceptional cases, the difference may matter.

Page 10: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Overview of homology search strategy

1.) Where should I search?

• NCBI

Has pretty much every thing that has been available for some time

• Genome projects

Has the updated information (DNA sequence as well as analysis result)

Page 11: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Overview of homology search strategy

2.) Which sequence should I use as the query?

• Protein• cDNA•

Genomic

Test

Practice: identify potential orthologs using either cDNA or protein sequence.

Page 12: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Overview of homology search strategy 2.) Which sequence should I use as the query?

cDNA (BlastN)

Protein (TblastN)

Page 13: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Overview of homology search strategy

2.) Which sequence should I use as the query?

Protein v.s cDNA

Searching at the protein level is much more sensitive

query: S A Lquery: TCT GCA TTG

target: S A Ltarget: AGC GCT CTA

Protein: ~ 5%

Nucleotide: ~ 25%

Base level identityProtein: 100%

Nucleotide: 33%

Page 14: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Overview of homology search strategy

2.) Which sequence should I use as the query?

If you want to identify similar feature at the DNA level. Be Cautious with genomic sequence initiated search

• Low complexity region

• repeats

Page 15: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Overview of homology search strategy

3.) Which data set should I search?

• Protein sequence (known and predicted) blastP, Smith_Waterman

• Genomic sequence

TblastN• EST

TblastN• Predicted genes

TblastN

Page 17: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Overview of homology search strategy

5.) How do I judge the significance of the match ?

• P-value, E -value• Alignment • Structural / Function information

Page 18: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Overview of homology search strategy

6.) How do I retrieve related information about the hit(s) ?

• NCBI is relatively easy

The scope of information collection can be enlarged by searching (linking) multiple databases (links). example• genome projects often have their own

interface and logistics (ie. Ensemble, wormbase, MGI, etc. )

Page 19: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Overview of homology search strategy

7.) How to align (compare) my query and the hits ?

• Global alignment• Local alignment

ClustalW/ClustalX

Page 20: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Situations where generic scoring matrix is not suitable

Short exact match

Specific patterns

Page 21: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

After Blast – which one is “real”?

Page 22: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Position –specific information about conserved domains is

IGNORED in single sequence –initiated search

BID_MOUSE SESQEEIIHN IARHLAQIGDEM DHNIQPTLVRBAD_MOUSE APPNLWAAQR YGRELRRMSDEF EGSFKGLPRPBAK_MOUSE PLEPNSILGQ VGRQLALIGDDI NRRYDTEFQNBAXB_HUMAN PVPQDASTKK LSECLKRIGDEL DSNMELQRMIBimS EPEDLRPEIR IAQELRRIGDEF NETYTRRVFAHRK_HUMAN LGLRSSAAQL TAARLKALGDEL HQRTMWRRRAEgl-1 DSEISSIGYE IGSKLAAMCDDF DAQMMSYSAH

BID_MOUSE SESQEEIIHN IARHLAQIGDEM DHNIQPTLVR

sequence X SESSSELLHN SAGHAAQLFDSM RLDIGSTAHRsequence Y PGLKSSAANI LSQQLKGIGDDL HQRMMSYSAH

Why a BLAST match is refused by the family ?

Page 23: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Specific patterns

1. DNA pattern – Transcription factor binding site.

2. Short protein pattern – enzyme recognition sites.

3. Protein motif/signature.

Page 24: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Binary patterns for protein and DNA

• Caspase recognition site:

[EDQN] X [^RKH] D [ASP]

Examples:

Observe: Search for potential caspase recognition sites with

BaGua

Page 25: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Searching for binary (string) patterns

Seq: A G G G C T C A T G A C A

G R C WG A C A G T

G R C WG A C A G T

G R C WG A C A G T

Positive match

Page 26: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Does binary pattern conveys all the information ?

• Weighted matrix / profile

• HMM model

For searching protein domains

Page 27: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

What determine a protein family?

• Structural similarity

• Functional conservation

Page 28: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Practice: motif analysis of protein sequence using ScanProsite and

Pfam

1. Open two taps for Pfam, input one of the Blast hits and one candidate TNF to each data window.

2. Compare the results

Page 29: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Scan protein for identified motifs

• A service provided by major motif databases such as Prosite,, Pfam, Block, etc.

• Protein family signature motif often indicates structural and function property.

• High frequency motifs may only have suggestive value.

Page 30: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

What is the possible function of my protein?Which family my protein belongs to?

-- Profile databases

• Pfam (http://pfam.wustl.edu/ )• Prosite (http://us.expasy.org/prosite/ )• IntePro (http://www.ebi.ac.uk/interpro/index.html )• Prints (

http://bioinf.man.ac.uk/dbbrowser/PRINTS/PRINTS.html )

Page 31: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Protein motif /domain

• Structural unit• Functional unit• Signature of protein family

How are they defined?

Page 32: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

How do we represent the position specific preference ?

BID_MOUSE I A R H L A Q I G D E MBAD_MOUSE Y G R E L R R M S D E FBAK_MOUSE V G R Q L A L I G D D IBAXB_HUMAN L S E C L K R I G D E L BimS I A Q E L R R I G D E FHRK_HUMAN T A A R L K A L G D E LEgl-1 I G S K L A A M C D D F

Binary pattern: L [GSC][HEQCRK]

X

[^ILMFV]

Basic concept of motif identification 2.

Page 33: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

How do we represent the position specific preference ?

BID_MOUSE I A R H L A Q I G D E MBAD_MOUSE Y G R E L R R M S D E FBAK_MOUSE V G R Q L A L I G D D IBAXB_HUMAN L S E C L K R I G D E L BimS I A Q E L R R I G D E FHRK_HUMAN T A A R L K A L G D E LEgl-1 I G S K L A A M C D D F

Statistical representation

G: 5 -> 71%

S: 1 -> 14 %

C: 1 -> 14 %

Basic concept of motif identification 2.

Page 34: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Scoring sequence based on Model

Seq: A S L D E L G D E

1 2 3 4A 1 0-4 .C 2 5-1 .D 1-3 9 .… . . . .

score @ position_1 = s(A/1) + s(S/2) + s(L/3) + s(D / 4)

An example of position specific matrix

Page 35: Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||

Representation of positional information in specific motif

M-C-N-S-S-C-[MV]-G-G-M-N-R-R.

Binary patterns:

Positional matrix:

-2.499 -2.269 -5.001 -4.568 -2.418 -4.589 -3.879 1.971 -4.330 1.477 -1.241 -4.221 -4.590 -4.097 -4.293 -3.808 0.495 2.545 -3.648 -3.265

1.627 -2.453 -1.804 -1.746 -3.528 2.539 1.544 -3.362 -1.440 -3.391 -2.490 -1.435 -3.076 -1.571 0.501 0.201 -1.930 -2.707 -3.473 -3.024

-1.346 -2.872 -1.367 0.699 -2.938 -2.427 -0.936 -2.632 -0.095 1.147 -1.684 -1.111 -2.531 1.174 2.105 1.057 -1.400 -2.255 -2.899 -2.260

1.045 1.754 -1.169 -0.576 -2.756 -2.212 1.686 -2.576 0.951 -2.438 -1.544 0.857 -2.301 1.891 1.556 -1.097 -1.180 -2.155 -2.751 -2.060

-3.385 -2.965 -5.039 -4.313 -1.529 -5.006 -3.577 0.429 -4.094 3.154 -0.121 -4.440 -4.199 -3.292 -3.662 -4.198 -3.281 -1.505 -3.043 -3.120

2.368 -3.197 -2.285 -1.533 -3.721 -2.945 1.815 -3.235 0.067 -3.061 -2.259 -1.680 -3.231 1.195 2.287 -2.009 -2.044 -2.825 -3.324 -2.844

1.046 1.742 0.576 -0.734 -2.072 -2.234 -0.851 0.436 -0.548 -0.129 -0.974 -1.039 -2.318 2.368 0.667 -1.135 -1.076 -1.304 -2.398 -1.821

0.715 -1.778 -3.820 -3.359 -1.535 -3.463 -2.571 3.060 -3.008 0.262 1.566 -2.996 -3.575 0.450 -3.001 -2.571 -1.765 0.193 -2.389 -2.068

-2.053 0.965 -2.767 -3.509 -4.520 3.654 -3.242 -4.548 -3.338 -4.968 -3.789 -2.462 -4.048 -3.738 -3.266 -2.596 -3.573 -3.990