multiple sequence alignment based on slides by irit gat-viks 1

60
Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Post on 15-Jan-2016

235 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Multiple Sequence Alignment

Based on slides by Irit Gat-Viks

1

Example

2

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Why multiple sequence alignment

bull Structure similarity ndash aa that play the same role in each structure are in the same column

bull Evolutionary similarity ndash aa related to the same ancestor are in the same column

bull Functional similarity - aa with the same function are in the same column

bull When seqs are closely related structure-evolution-functional similarity equivalent

3

Multiple Alignment Definition

CG copy Ron Shamir 065

Input Sequences S1 S2 hellip Sk over the same alphabetOutput Gapped sequences Srsquo1 Srsquo2 hellip Srsquok of equal length

1 |Srsquo1|= |Srsquo2|=hellip= |Srsquok|

2 Removal of spaces from Srsquoi gives Si for all i

Example

CG copy Ron Shamir 066

S1=AGGTC

S2=GTTCG

S3=TGAACPossible alignment

A-T

GGG

G--

TTA

-TA

CCC

-G-

Possible alignment

AG-

GTT

GTG

T-A

--A

CCA

-GC

CG copy Ron Shamir 068

Example

CG copy Ron Shamir 069

Multiple sequence alignment of 7 neuroglobins using clustalx

Human-centric beta globin Multiple Alignment

CG copy Ron Shamir 0610 httpglobincsepsuedu

MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a

protein familybull Understand evolution - preliminary step in molecular evolution

analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly

bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function

bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family

bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals

11

Protein Phylogenies ndash Example

CG copy Ron Shamir 0612

Kinase domain

Scoring alignments

bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score

bull Scores previewndash Sum of pairsndash Consensusndash Tree

bull Varying methods (and controversy)

CG copy Ron Shamir 0615

Sum of Pairs scoreDef Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example

x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG

Induces

x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG

CG copy Ron Shamir 0616

S(M) = kltl (Srsquok Srsquol)

SOP Score Example

CG copy Ron Shamir 0617

Consider the following alignment

AC-CDB--C-ADBDA-BCDAD

Scoring scheme match - 0mismatchindel - -1

SP score -3 -5 -4 =-12

Multiple Alignment with SOP scores is NP-hard

18

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

Consensus MSA

bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence

bull More difficult to finddefine as the consensus sequence itself is difficult to define

bull Used mainly for computational proofs

CG copy Ron Shamir 0619

20

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

Tree MSA

bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string

to each internal node

bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star

CG copy Ron Shamir 0621

CTGG

CCGG

GTTC

CTTG

GTTG

GTTG

CTGG

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 2: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Example

2

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Why multiple sequence alignment

bull Structure similarity ndash aa that play the same role in each structure are in the same column

bull Evolutionary similarity ndash aa related to the same ancestor are in the same column

bull Functional similarity - aa with the same function are in the same column

bull When seqs are closely related structure-evolution-functional similarity equivalent

3

Multiple Alignment Definition

CG copy Ron Shamir 065

Input Sequences S1 S2 hellip Sk over the same alphabetOutput Gapped sequences Srsquo1 Srsquo2 hellip Srsquok of equal length

1 |Srsquo1|= |Srsquo2|=hellip= |Srsquok|

2 Removal of spaces from Srsquoi gives Si for all i

Example

CG copy Ron Shamir 066

S1=AGGTC

S2=GTTCG

S3=TGAACPossible alignment

A-T

GGG

G--

TTA

-TA

CCC

-G-

Possible alignment

AG-

GTT

GTG

T-A

--A

CCA

-GC

CG copy Ron Shamir 068

Example

CG copy Ron Shamir 069

Multiple sequence alignment of 7 neuroglobins using clustalx

Human-centric beta globin Multiple Alignment

CG copy Ron Shamir 0610 httpglobincsepsuedu

MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a

protein familybull Understand evolution - preliminary step in molecular evolution

analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly

bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function

bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family

bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals

11

Protein Phylogenies ndash Example

CG copy Ron Shamir 0612

Kinase domain

Scoring alignments

bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score

bull Scores previewndash Sum of pairsndash Consensusndash Tree

bull Varying methods (and controversy)

CG copy Ron Shamir 0615

Sum of Pairs scoreDef Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example

x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG

Induces

x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG

CG copy Ron Shamir 0616

S(M) = kltl (Srsquok Srsquol)

SOP Score Example

CG copy Ron Shamir 0617

Consider the following alignment

AC-CDB--C-ADBDA-BCDAD

Scoring scheme match - 0mismatchindel - -1

SP score -3 -5 -4 =-12

Multiple Alignment with SOP scores is NP-hard

18

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

Consensus MSA

bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence

bull More difficult to finddefine as the consensus sequence itself is difficult to define

bull Used mainly for computational proofs

CG copy Ron Shamir 0619

20

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

Tree MSA

bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string

to each internal node

bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star

CG copy Ron Shamir 0621

CTGG

CCGG

GTTC

CTTG

GTTG

GTTG

CTGG

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 3: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Why multiple sequence alignment

bull Structure similarity ndash aa that play the same role in each structure are in the same column

bull Evolutionary similarity ndash aa related to the same ancestor are in the same column

bull Functional similarity - aa with the same function are in the same column

bull When seqs are closely related structure-evolution-functional similarity equivalent

3

Multiple Alignment Definition

CG copy Ron Shamir 065

Input Sequences S1 S2 hellip Sk over the same alphabetOutput Gapped sequences Srsquo1 Srsquo2 hellip Srsquok of equal length

1 |Srsquo1|= |Srsquo2|=hellip= |Srsquok|

2 Removal of spaces from Srsquoi gives Si for all i

Example

CG copy Ron Shamir 066

S1=AGGTC

S2=GTTCG

S3=TGAACPossible alignment

A-T

GGG

G--

TTA

-TA

CCC

-G-

Possible alignment

AG-

GTT

GTG

T-A

--A

CCA

-GC

CG copy Ron Shamir 068

Example

CG copy Ron Shamir 069

Multiple sequence alignment of 7 neuroglobins using clustalx

Human-centric beta globin Multiple Alignment

CG copy Ron Shamir 0610 httpglobincsepsuedu

MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a

protein familybull Understand evolution - preliminary step in molecular evolution

analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly

bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function

bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family

bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals

11

Protein Phylogenies ndash Example

CG copy Ron Shamir 0612

Kinase domain

Scoring alignments

bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score

bull Scores previewndash Sum of pairsndash Consensusndash Tree

bull Varying methods (and controversy)

CG copy Ron Shamir 0615

Sum of Pairs scoreDef Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example

x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG

Induces

x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG

CG copy Ron Shamir 0616

S(M) = kltl (Srsquok Srsquol)

SOP Score Example

CG copy Ron Shamir 0617

Consider the following alignment

AC-CDB--C-ADBDA-BCDAD

Scoring scheme match - 0mismatchindel - -1

SP score -3 -5 -4 =-12

Multiple Alignment with SOP scores is NP-hard

18

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

Consensus MSA

bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence

bull More difficult to finddefine as the consensus sequence itself is difficult to define

bull Used mainly for computational proofs

CG copy Ron Shamir 0619

20

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

Tree MSA

bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string

to each internal node

bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star

CG copy Ron Shamir 0621

CTGG

CCGG

GTTC

CTTG

GTTG

GTTG

CTGG

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 4: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Multiple Alignment Definition

CG copy Ron Shamir 065

Input Sequences S1 S2 hellip Sk over the same alphabetOutput Gapped sequences Srsquo1 Srsquo2 hellip Srsquok of equal length

1 |Srsquo1|= |Srsquo2|=hellip= |Srsquok|

2 Removal of spaces from Srsquoi gives Si for all i

Example

CG copy Ron Shamir 066

S1=AGGTC

S2=GTTCG

S3=TGAACPossible alignment

A-T

GGG

G--

TTA

-TA

CCC

-G-

Possible alignment

AG-

GTT

GTG

T-A

--A

CCA

-GC

CG copy Ron Shamir 068

Example

CG copy Ron Shamir 069

Multiple sequence alignment of 7 neuroglobins using clustalx

Human-centric beta globin Multiple Alignment

CG copy Ron Shamir 0610 httpglobincsepsuedu

MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a

protein familybull Understand evolution - preliminary step in molecular evolution

analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly

bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function

bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family

bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals

11

Protein Phylogenies ndash Example

CG copy Ron Shamir 0612

Kinase domain

Scoring alignments

bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score

bull Scores previewndash Sum of pairsndash Consensusndash Tree

bull Varying methods (and controversy)

CG copy Ron Shamir 0615

Sum of Pairs scoreDef Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example

x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG

Induces

x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG

CG copy Ron Shamir 0616

S(M) = kltl (Srsquok Srsquol)

SOP Score Example

CG copy Ron Shamir 0617

Consider the following alignment

AC-CDB--C-ADBDA-BCDAD

Scoring scheme match - 0mismatchindel - -1

SP score -3 -5 -4 =-12

Multiple Alignment with SOP scores is NP-hard

18

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

Consensus MSA

bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence

bull More difficult to finddefine as the consensus sequence itself is difficult to define

bull Used mainly for computational proofs

CG copy Ron Shamir 0619

20

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

Tree MSA

bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string

to each internal node

bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star

CG copy Ron Shamir 0621

CTGG

CCGG

GTTC

CTTG

GTTG

GTTG

CTGG

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 5: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Example

CG copy Ron Shamir 066

S1=AGGTC

S2=GTTCG

S3=TGAACPossible alignment

A-T

GGG

G--

TTA

-TA

CCC

-G-

Possible alignment

AG-

GTT

GTG

T-A

--A

CCA

-GC

CG copy Ron Shamir 068

Example

CG copy Ron Shamir 069

Multiple sequence alignment of 7 neuroglobins using clustalx

Human-centric beta globin Multiple Alignment

CG copy Ron Shamir 0610 httpglobincsepsuedu

MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a

protein familybull Understand evolution - preliminary step in molecular evolution

analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly

bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function

bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family

bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals

11

Protein Phylogenies ndash Example

CG copy Ron Shamir 0612

Kinase domain

Scoring alignments

bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score

bull Scores previewndash Sum of pairsndash Consensusndash Tree

bull Varying methods (and controversy)

CG copy Ron Shamir 0615

Sum of Pairs scoreDef Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example

x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG

Induces

x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG

CG copy Ron Shamir 0616

S(M) = kltl (Srsquok Srsquol)

SOP Score Example

CG copy Ron Shamir 0617

Consider the following alignment

AC-CDB--C-ADBDA-BCDAD

Scoring scheme match - 0mismatchindel - -1

SP score -3 -5 -4 =-12

Multiple Alignment with SOP scores is NP-hard

18

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

Consensus MSA

bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence

bull More difficult to finddefine as the consensus sequence itself is difficult to define

bull Used mainly for computational proofs

CG copy Ron Shamir 0619

20

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

Tree MSA

bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string

to each internal node

bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star

CG copy Ron Shamir 0621

CTGG

CCGG

GTTC

CTTG

GTTG

GTTG

CTGG

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 6: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

CG copy Ron Shamir 068

Example

CG copy Ron Shamir 069

Multiple sequence alignment of 7 neuroglobins using clustalx

Human-centric beta globin Multiple Alignment

CG copy Ron Shamir 0610 httpglobincsepsuedu

MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a

protein familybull Understand evolution - preliminary step in molecular evolution

analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly

bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function

bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family

bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals

11

Protein Phylogenies ndash Example

CG copy Ron Shamir 0612

Kinase domain

Scoring alignments

bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score

bull Scores previewndash Sum of pairsndash Consensusndash Tree

bull Varying methods (and controversy)

CG copy Ron Shamir 0615

Sum of Pairs scoreDef Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example

x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG

Induces

x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG

CG copy Ron Shamir 0616

S(M) = kltl (Srsquok Srsquol)

SOP Score Example

CG copy Ron Shamir 0617

Consider the following alignment

AC-CDB--C-ADBDA-BCDAD

Scoring scheme match - 0mismatchindel - -1

SP score -3 -5 -4 =-12

Multiple Alignment with SOP scores is NP-hard

18

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

Consensus MSA

bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence

bull More difficult to finddefine as the consensus sequence itself is difficult to define

bull Used mainly for computational proofs

CG copy Ron Shamir 0619

20

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

Tree MSA

bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string

to each internal node

bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star

CG copy Ron Shamir 0621

CTGG

CCGG

GTTC

CTTG

GTTG

GTTG

CTGG

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 7: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Example

CG copy Ron Shamir 069

Multiple sequence alignment of 7 neuroglobins using clustalx

Human-centric beta globin Multiple Alignment

CG copy Ron Shamir 0610 httpglobincsepsuedu

MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a

protein familybull Understand evolution - preliminary step in molecular evolution

analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly

bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function

bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family

bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals

11

Protein Phylogenies ndash Example

CG copy Ron Shamir 0612

Kinase domain

Scoring alignments

bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score

bull Scores previewndash Sum of pairsndash Consensusndash Tree

bull Varying methods (and controversy)

CG copy Ron Shamir 0615

Sum of Pairs scoreDef Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example

x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG

Induces

x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG

CG copy Ron Shamir 0616

S(M) = kltl (Srsquok Srsquol)

SOP Score Example

CG copy Ron Shamir 0617

Consider the following alignment

AC-CDB--C-ADBDA-BCDAD

Scoring scheme match - 0mismatchindel - -1

SP score -3 -5 -4 =-12

Multiple Alignment with SOP scores is NP-hard

18

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

Consensus MSA

bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence

bull More difficult to finddefine as the consensus sequence itself is difficult to define

bull Used mainly for computational proofs

CG copy Ron Shamir 0619

20

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

Tree MSA

bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string

to each internal node

bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star

CG copy Ron Shamir 0621

CTGG

CCGG

GTTC

CTTG

GTTG

GTTG

CTGG

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 8: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Human-centric beta globin Multiple Alignment

CG copy Ron Shamir 0610 httpglobincsepsuedu

MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a

protein familybull Understand evolution - preliminary step in molecular evolution

analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly

bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function

bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family

bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals

11

Protein Phylogenies ndash Example

CG copy Ron Shamir 0612

Kinase domain

Scoring alignments

bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score

bull Scores previewndash Sum of pairsndash Consensusndash Tree

bull Varying methods (and controversy)

CG copy Ron Shamir 0615

Sum of Pairs scoreDef Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example

x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG

Induces

x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG

CG copy Ron Shamir 0616

S(M) = kltl (Srsquok Srsquol)

SOP Score Example

CG copy Ron Shamir 0617

Consider the following alignment

AC-CDB--C-ADBDA-BCDAD

Scoring scheme match - 0mismatchindel - -1

SP score -3 -5 -4 =-12

Multiple Alignment with SOP scores is NP-hard

18

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

Consensus MSA

bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence

bull More difficult to finddefine as the consensus sequence itself is difficult to define

bull Used mainly for computational proofs

CG copy Ron Shamir 0619

20

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

Tree MSA

bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string

to each internal node

bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star

CG copy Ron Shamir 0621

CTGG

CCGG

GTTC

CTTG

GTTG

GTTG

CTGG

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 9: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

MSA applicationsbull Generate protein familiesbull Extrapolation ndash membership of uncharacterized sequence to a

protein familybull Understand evolution - preliminary step in molecular evolution

analysis for constructing phylogenetic trees eg is the duck evolutionary closer to a lion or to a fruit fly

bull Pattern identification ndash find the important (conserved) region in the protein - conserved positions may characterize a function

bull Domain identification ndash Build a consensusprofilemotif that describe the protein family help to describe new members of the family

bull DNA regulatory elementsbull Structure prediction (secondary and 3D model)bull Alignment of multiple sequences may reveal weak signals

11

Protein Phylogenies ndash Example

CG copy Ron Shamir 0612

Kinase domain

Scoring alignments

bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score

bull Scores previewndash Sum of pairsndash Consensusndash Tree

bull Varying methods (and controversy)

CG copy Ron Shamir 0615

Sum of Pairs scoreDef Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example

x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG

Induces

x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG

CG copy Ron Shamir 0616

S(M) = kltl (Srsquok Srsquol)

SOP Score Example

CG copy Ron Shamir 0617

Consider the following alignment

AC-CDB--C-ADBDA-BCDAD

Scoring scheme match - 0mismatchindel - -1

SP score -3 -5 -4 =-12

Multiple Alignment with SOP scores is NP-hard

18

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

Consensus MSA

bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence

bull More difficult to finddefine as the consensus sequence itself is difficult to define

bull Used mainly for computational proofs

CG copy Ron Shamir 0619

20

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

Tree MSA

bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string

to each internal node

bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star

CG copy Ron Shamir 0621

CTGG

CCGG

GTTC

CTTG

GTTG

GTTG

CTGG

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 10: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Protein Phylogenies ndash Example

CG copy Ron Shamir 0612

Kinase domain

Scoring alignments

bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score

bull Scores previewndash Sum of pairsndash Consensusndash Tree

bull Varying methods (and controversy)

CG copy Ron Shamir 0615

Sum of Pairs scoreDef Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example

x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG

Induces

x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG

CG copy Ron Shamir 0616

S(M) = kltl (Srsquok Srsquol)

SOP Score Example

CG copy Ron Shamir 0617

Consider the following alignment

AC-CDB--C-ADBDA-BCDAD

Scoring scheme match - 0mismatchindel - -1

SP score -3 -5 -4 =-12

Multiple Alignment with SOP scores is NP-hard

18

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

Consensus MSA

bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence

bull More difficult to finddefine as the consensus sequence itself is difficult to define

bull Used mainly for computational proofs

CG copy Ron Shamir 0619

20

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

Tree MSA

bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string

to each internal node

bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star

CG copy Ron Shamir 0621

CTGG

CCGG

GTTC

CTTG

GTTG

GTTG

CTGG

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 11: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Scoring alignments

bull Given input seqs S1 S2 hellip Sk find a multiple alignment of optimal score

bull Scores previewndash Sum of pairsndash Consensusndash Tree

bull Varying methods (and controversy)

CG copy Ron Shamir 0615

Sum of Pairs scoreDef Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example

x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG

Induces

x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG

CG copy Ron Shamir 0616

S(M) = kltl (Srsquok Srsquol)

SOP Score Example

CG copy Ron Shamir 0617

Consider the following alignment

AC-CDB--C-ADBDA-BCDAD

Scoring scheme match - 0mismatchindel - -1

SP score -3 -5 -4 =-12

Multiple Alignment with SOP scores is NP-hard

18

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

Consensus MSA

bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence

bull More difficult to finddefine as the consensus sequence itself is difficult to define

bull Used mainly for computational proofs

CG copy Ron Shamir 0619

20

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

Tree MSA

bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string

to each internal node

bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star

CG copy Ron Shamir 0621

CTGG

CCGG

GTTC

CTTG

GTTG

GTTG

CTGG

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 12: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Sum of Pairs scoreDef Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example

x AC-GCGG-C y AC-GC-GAG z GCCGC-GAG

Induces

x ACGCGG-C x AC-GCGG-C y AC-GCGAGy ACGC-GAC z GCCGC-GAG z GCCGCGAG

CG copy Ron Shamir 0616

S(M) = kltl (Srsquok Srsquol)

SOP Score Example

CG copy Ron Shamir 0617

Consider the following alignment

AC-CDB--C-ADBDA-BCDAD

Scoring scheme match - 0mismatchindel - -1

SP score -3 -5 -4 =-12

Multiple Alignment with SOP scores is NP-hard

18

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

Consensus MSA

bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence

bull More difficult to finddefine as the consensus sequence itself is difficult to define

bull Used mainly for computational proofs

CG copy Ron Shamir 0619

20

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

Tree MSA

bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string

to each internal node

bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star

CG copy Ron Shamir 0621

CTGG

CCGG

GTTC

CTTG

GTTG

GTTG

CTGG

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 13: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

SOP Score Example

CG copy Ron Shamir 0617

Consider the following alignment

AC-CDB--C-ADBDA-BCDAD

Scoring scheme match - 0mismatchindel - -1

SP score -3 -5 -4 =-12

Multiple Alignment with SOP scores is NP-hard

18

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

Consensus MSA

bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence

bull More difficult to finddefine as the consensus sequence itself is difficult to define

bull Used mainly for computational proofs

CG copy Ron Shamir 0619

20

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

Tree MSA

bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string

to each internal node

bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star

CG copy Ron Shamir 0621

CTGG

CCGG

GTTC

CTTG

GTTG

GTTG

CTGG

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 14: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

18

בתיאוריה כן באופן האם אפשר פשוט להכליל את שיטת התיכנות הדינמי פרקטי לא

מדובר בזמן ריצהובגודל זיכרון הגדלים

כאשר NKכ Nהוא אורך הרצף Kמספר הרצפים

למעשה בלתי אפשריKgt3 עבור

הבעיה החישובית בהינתן מספר רצפים מצא את ההתאמה שלהם למסגרת תקבל ערך אופטימלי שפונקצית מרחקמשותפת )עי הוספת רווחים( כך

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

SCGPFIRVMSCGPGLRASCTPHLAMSCPKIRGMSLPLLRNMSHKPALRA

Consensus MSA

bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence

bull More difficult to finddefine as the consensus sequence itself is difficult to define

bull Used mainly for computational proofs

CG copy Ron Shamir 0619

20

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

Tree MSA

bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string

to each internal node

bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star

CG copy Ron Shamir 0621

CTGG

CCGG

GTTC

CTTG

GTTG

GTTG

CTGG

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 15: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Consensus MSA

bull Score ndashsum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence

bull More difficult to finddefine as the consensus sequence itself is difficult to define

bull Used mainly for computational proofs

CG copy Ron Shamir 0619

20

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

Tree MSA

bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string

to each internal node

bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star

CG copy Ron Shamir 0621

CTGG

CCGG

GTTC

CTTG

GTTG

GTTG

CTGG

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 16: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

20

-SCGPFIRVMSCGPGLRA-SCTPHL-A

-SCGPFIRVMSCGPGLRA

-SCGPFIRV-SCTPHL-A

MSCGPGLRA-SCTPHL-A

5 3 5 13

Scoring metrics -examplesSum of pairs

-SCGPFIRVMSCGPGLRA-SCTPHL-AMSC-PKIRGMS-LPLLRNMSHKPALRA

3 5 4 1 6 2 4 6 354סהכ הומוגניות

3 1 2 5 0 4 2 0 42 19סהכ מרחק

Distance from concensus

Tree MSA

bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string

to each internal node

bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star

CG copy Ron Shamir 0621

CTGG

CCGG

GTTC

CTTG

GTTG

GTTG

CTGG

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 17: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Tree MSA

bull Input Tree T a string for each leafbull Phylogenetic alignment for T Assignment of a string

to each internal node

bull Score ndash (weighted) sum of scores along edgesbull Goal find phyl alignment of optimal scorebull Consensus = phyl Alignment where T is a star

CG copy Ron Shamir 0621

CTGG

CCGG

GTTC

CTTG

GTTG

GTTG

CTGG

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 18: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Profile Representation of MA

bull Alternatively use log oddsbull pi(a) = fraction of arsquos in col i bull p(a) = fraction of arsquos overallbull log pi(a)p(a)

CG copy Ron Shamir 0623

- A G G C T A T C A C C T G T A G ndash C T A C C A - - - G C A G ndash C T A C C A - - - G C A G ndash C T A T C A C ndash G G C A G ndash C T A T C G C ndash G G

A 1 1 8 C 6 1 4 1 6 2G 1 2 2 4 1T 2 1 6 2- 2 8 4 8 4

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 19: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Aligning a sequence to a profile

bull Key in pairwise alignment is scoring two positions xy (xy)

bull For a letter x and a column y in a profile (xy)=value of x in col Y

bull Invent a score for (x-)bull Run the DP alg for pairwise alignment

CG copy Ron Shamir 0625

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 20: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Aligning alignments

bull Given two alignments how can we align them

bull Hint use DP on the corresponding profiles

CG copy Ron Shamir 0626

x GGGCACTGCATy GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2v GGACCT-----

x GGGCACTGCATy GGTTACGTC--z GGGAACTGCAG

w GGACGTACC-- v GGACCT-----

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 21: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Multiple Alignment Greedy Heuristic

bull Choose most similar pair of sequences and combine into a profile thereby reducing alignment of k sequences to an alignment of of k-1 sequencesprofiles Repeat

CG copy Ron Shamir 0627

u1= ACGTACGTACGThellip

u2 = TTAATTAATTAAhellip

u3 = ACTACTACTACThellip

hellip

uk = CCGGCCGGCCGG

u1= ACgtTACgtTACgcThellip

u2 = TTAATTAATTAAhellip

hellip

uk = CCGGCCGGCCGGhellip

k

k-1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 22: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

ClustalW Thompson Higgins Gibson 94

bull Popular multiple alignment tool todaybull lsquoWrsquo = lsquoweightedrsquo (different parts of alignment

are weighted differently)bull Three-step process

1) Construct pairwise alignments2) Build Guide Tree3) Progressive alignment guided by the tree

CG copy Ron Shamir 0628

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 23: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Step 1 Pairwise Alignment

bull Aligns each sequence against each other giving a similarity matrix

bull Similarity = exact matches sequence length (percent identity)

CG copy Ron Shamir 0629

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

)17 means 17 identical(

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 24: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Step 2 Guide Tree

bull Use the similarity method to create a Guide Tree by applying some clustering methodbull Guide tree roughly reflects evolutionary relationsbull ClustalW uses the neighbor-joining method which

iterativelybull Selects the closest pair of sequencessubtreesbull Combines them into a single subtreebull Re-computes the distances from the new subtree to all the other

sequencessubtrees

CG copy Ron Shamir 0630

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 25: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Step 2 Guide Tree (contrsquod)

CG copy Ron Shamir 0631

v1

v3

v4 v2

Calculatev13 = alignment (v1 v3)v134 = alignment((v13)v4)v1234 = alignment((v134)v2)

v1 v2 v3 v4

v1 -v2 17 -v3 87 28 -v4 59 33 62 -

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 26: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Step 3 Progressive Alignmentbull Start by aligning the two most similar sequencesbull Using the guide tree add in the most similar pair

(seq-seq seq-prof or prof-prof)bull Insert gaps as necessarybull Many ad-hoc rules weighting different matrices

special gap scoreshellip

CG copy Ron Shamir 0632

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ

Dots and stars show how well-conserved a column is

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 27: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

33

בונים עץ עי חיבור הרצפים הדומים לפי סדר התאמתם2

Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale

בונים את ההתאמה לפי הסדר המוכתב3 עי העץ

ישנם שלושה מצביםהתאמת זוג רצפים - התאמה בין התאמות-הוספת רצף בודד להתאמה קיימת-

בונים טבלה של מרחק עריכה בין כל שני רצפים1CLUSTALW algorithm

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 28: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

34

נקודות עדינותנותנים משקלות שונים לרצפים כך שאם יש מספר רצפים מאד דומים bull

המשקל היחסי של כל אחד יקטןשיטה מיוחדת לקביעת מחיר להכנסת רווחיםbull

חסרונותמאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה bull

של הזוגות הראשוניםOnce a Gap Always a gapרווחים אינם נעלמים bullרגישות רבה לקנס הרווחיםbull

יתרונותמהירbullסבירbullאפשר לקבוע הרבה פרמטריםbullכולם משתמשיםbull

CLUSTALW algorithm

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 29: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

CLUSTALW algorithmbull We can deduce a pairwise alignment for each two

sequences in the multiple alignment (projected pairwise alignment)

bull The projected pairwise alignment is NOT the best pairwise alignment for the two sequences

bull CLUTALW is not an optimal algorithm Better alignments might exist The algorithm yields a possible alignment but not necessarily the best one35

Best Pairwise alignment (optimal)

Projected Pairwise alignment

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 30: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

ClustalW at EMBL

36

httpwwwebiacukclustalw Clustalw at the SRS site at EBI

httpwwwcstauacil~ulitskyicgGBAfastatxt

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 31: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

ClustalW Output Aln format

37

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 32: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

MSA algorithms

bull Progressive methods (CLUSTALWT-Coffee) bull Iterative methods (Dialign)bull Direct optimization (Monte Carlo genetic

algorithms)bull Local methods eMotifs Blocks Psi-blast

38

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 33: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

MSA Editing Jalview

39

Conservation

wwwesembnetorgServicesMolBiojalviewindexhtml

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 34: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

MSA formats - fasta

40

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 35: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

MSA formats - Aln

41

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 36: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

MSA formats - MSF

42

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 37: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Example 1a a good MSA

43

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 38: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

44

Example 1b making MSA of distantly related proteins

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 39: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

45

Example 1c including more distant relatives in the MSA

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 40: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Example 2 Isopenicillin N Synthasebull Mononuclear iron proteins ndash electron carrier proteins

Iron atoms are bound to amino acid side chainsbull In IPNS the metal ion is coordinated by three protein

residues bull IPNS is involved in biosynthesis of penicillin

46

N

SN

OCOOHO

H

H

Me

Me

COOH

NH2

N

SN

OCOOH

NH2 Me

Me

COOHO

ACV

Isopenicillin N

Fe+2

Ascorbate

O2

2H2O Fe

N

NHis268

H

N

NHis212

H

SACV

O2 (NO)

H2O

OAsp214

O

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 41: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Research IPNS

bull Goal Identify Fe+2 binding residuesbull Possible solutions

1 In the lab2 Bioinformatic approach (comparing different

IPNS sequences)

47

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 42: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Step 1

Multiple alignment of known IPNS

Implementation1 Obtain sequence (eg for NCBI) IPNS AND Bacteria[Organism] 2 MSA (clustalw) and search for conserved residues in the MSA

48

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 43: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

MSA ndash bacteria only

49

Not enough variation

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 44: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

50

MSA ndash bacteria amp fungi

Not enough variation

bacteria amp fungi

bacteria

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 45: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

Step 2Goal Add more enzymes similar to IPNS

Implementationbull Search in httpwwwexpasyorgtoolsblastbull Blast IPNS_CEPAC as querybull Select sequences similar to the query in the entire lengthbull Export in FASTA formatbull Run CLUSTALW

51

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 46: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

52

Step 2Goal Add more enzymes similar to IPNS

ImplementationbullSearch in httpwwwexpasyorgtoolsblast

bullBlast IPNS_CEPAC as querybullSelect sequences similar to the query in the entire length

bullExport in FASTA formatbullRun CLUSTALW

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 47: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

53

bull New multiple alignment narrowing down the possibilities

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 48: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

54

Simple multiple alignment

bull The known IPNS sequences are very similarbull Close enzymes sequences are also quite similarbull Not enough variability to categorize the active sitesbull We need to obtain even more distant sequences

(distant homologs)

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 49: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

55

Step 3Using the results of the MSA for further searches

Implementation1 Obtain an MSA (clustalw)2 - Construct a consensus sequence and perform a new search

OR - Construct a profile and perform a new search

3 MSA (clustalw) and search for conserved residues in the MSA

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 50: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

56

Consensus Sequence

bull We can deduce a consensus sequence from the multiple sequence alignment The consensus sequence holds the most frequent character of the alignment at each column

bull Consensus each position reflects the most common character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 51: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

57

Profilebull We can deduce a statistical model describing the

multiple sequence alignment A Profile holds statistical information about characters in alignment at each column

bull Profile each position reflects the frequency of the character found at a position

A T C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 1 067 0 0

T 0 033 1 1

C 0 0 0 0

G 0 0 0 0

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 52: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

58

Profile vs ConsensusbullThe following multiple alignments will have

the same consensusA A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 53: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

59

Profile vs ConsensusbullBut have a different profile

A A C T T G C

A A G T C G T

C A C T T C T

A A C T T G T

A A C T T G T

A A C T T C T

1 2 3 4 5 6

A 066 1 0 0

T 0 0 0 1

C 033 0 066

0

G 0 0 033

0

1 2 3 4 5 6

A 1 1 0 0

T 0 0 0 1

C 0 0 1 0

G 0 0 0 0

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 54: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

60

Sequence LOGO A A C C G C T C T T

A G C C G C G C - T

A - C A G A G C C T

A A G C A C G C - T

A C G G G T G C T T

A T G C ndash C G C - T

A gc c g G C T

httpweblogoberkeleyedu

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 55: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

61

Psi BlastbullPosition Specific Iterated - automatic profile-

like search

Regular blast

Construct profile from blast results

Blast profile search

Final results

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 56: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

62

bull Alignment with distantly related proteins

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 57: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

63

bull Experimental evidence supports the finding that His212 Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS

Enzyme Relative Km kcat kcatKmActivity (mM) (min-1) (mM-1

min-1)

Wild type 100 04 388 969His48Ala 16 056 75 134His63Ala 31 10 142 142His114Ala 28 085 125 147His124Ala 48 084 321 381His135Ala 22 059 117 198His212Ala lt0007 nd ndHis268Ala lt0003 nd nd Asp14Ala 5 086 056 07Asp113Ala 63 045 238 528Asp131Ala 68 048 363 755Asp203Ala 32 091 123 135Asp214Ala lt0004 nd nd

Isopenicillin N Synthase

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 58: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

64

ndash IPNS

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 59: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

65

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66
Page 60: Multiple Sequence Alignment Based on slides by Irit Gat-Viks 1

66

  • Multiple Sequence Alignment
  • Example
  • Why multiple sequence alignment
  • Multiple Alignment Definition
  • Slide 6
  • Slide 8
  • Slide 9
  • Human-centric beta globin Multiple Alignment
  • MSA applications
  • Protein Phylogenies ndash Example
  • Scoring alignments
  • Sum of Pairs score
  • SOP Score Example
  • Slide 18
  • Consensus MSA
  • Slide 20
  • Tree MSA
  • Profile Representation of MA
  • Aligning a sequence to a profile
  • Aligning alignments
  • Multiple Alignment Greedy Heuristic
  • ClustalW Thompson Higgins Gibson 94
  • Step 1 Pairwise Alignment
  • Step 2 Guide Tree
  • Step 2 Guide Tree (contrsquod)
  • Step 3 Progressive Alignment
  • Slide 33
  • Slide 34
  • CLUSTALW algorithm
  • ClustalW at EMBL
  • ClustalW Output Aln format
  • MSA algorithms
  • MSA Editing Jalview
  • MSA formats - fasta
  • MSA formats - Aln
  • MSA formats - MSF
  • Example 1a a good MSA
  • Slide 44
  • Slide 45
  • Example 2 Isopenicillin N Synthase
  • Research IPNS
  • Step 1
  • MSA ndash bacteria only
  • MSA ndash bacteria amp fungi
  • Step 2
  • Slide 52
  • Slide 53
  • Simple multiple alignment
  • Step 3
  • Consensus Sequence
  • Profile
  • Profile vs Consensus
  • Slide 59
  • Slide 60
  • Psi Blast
  • Slide 62
  • Isopenicillin N Synthase
  • Slide 64
  • Slide 65
  • Slide 66