tutorial 5 multiple sequence alignments and motif...
TRANSCRIPT
• Multiple sequence alignment
– ClustalW
– Muscle
• Motif discovery
– MEME
– Jaspar
Multiple sequence alignments and motif discovery
• More than two sequences
– DNA
– Protein
• Evolutionary relation
– Homology Phylogenetic tree
– Detect motif
Multiple Sequence Alignment
GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC
A
D B
CGTCGTAGTCGGCTCGACGTCTAGCGAGCGTGATGCGAAGAGGCGAGCGCCGTCGCGTCGTAAC
• Dynamic Programming
– Optimal alignment
– Exponential in #Sequences
• Progressive
– Efficient
– Heuristic
Multiple Sequence Alignment
GTCGTAGTCG-GC-TCGAC
GTC-TAG-CGAGCGT-GAT
GC-GAAG-AG-GCG-AG-C
GCCGTCG-CG-TCGTA-AC
A
D B
CGTCGTAGTCGGCTCGAC
GTCTAGCGAGCGTGAT
GCGAAGAGGCGAGC
GCCGTCGCGTCGTAAC
ClustalW
“CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice”, J D Thompson et al
Pairwise alignment – calculate
distance matrix
Guided tree
Progressive alignment using the guide tree
ClustalW
• Progressive
– At each step align two existing alignments or sequences
– Gaps present in older alignments remain fixed
-TGTTAAC
-TGT-AAC
-TGT--AC
ATGT---C
ATGT-GGC
ClustalW - Inputhttp://www.ebi.ac.uk/Tools/clustalw2/index.html
Input sequences
Gap scoring
Scoring matrix
Email address
Output format
Can we find motifs using multiple sequence alignment?
1 2 3 4 5 6 7 8 9 10
A 0 0 0 0 0 0.5 1/6 1/3 0 0
D 0 0.5 1/3 0 0 1/6 5/6 1/6 0 1/6
E 0 0 2/3 1 0 0 0 0 1 5/6
G 0 1/6 0 0 1 1/3 0 0 0 0
H 0 1/6 0 0 0 0 0 0 0 0
N 0 1/6 0 0 0 0 0 0 0 0
Y 1 0 0 0 0 0 0.5 0.5 0 0
1 3 5 7 9
..YDEEGGDAEE..
..YDEEGGDAEE..
..YGEEGADYED..
..YDEEGADYEE..
..YNDEGDDYEE..
..YHDEGAADEE..
* :** *:
MotifA widespread pattern with a biological significance
MEME – Multiple EM* for Motif finding
• http://meme.sdsc.edu/
• Motif discovery from unaligned sequences
– Genomic or protein sequences
• Flexible model of motif presence (Motif can be absent in some sequences or appear several times in one sequence)
*Expectation-maximization
MEME - InputEmail address
Input file (fasta file)
How many times in each
sequence?
How many motifs?
How many sites?
Range of motif
lengths
MAST
• Searches for motifs (one or more) in sequence databases:
– Like BLAST but motifs for input
– Similar to iterations of PSI-BLAST
• Profile defines strength of match
– Multiple motif matches per sequence
– Combined E value for all motifs
• MEME uses MAST to summarize results:
– Each MEME result is accompanied by the MAST result for searching the discovered motifs on the given sequences.
http://meme.sdsc.edu/meme4_4_0/cgi-bin/mast.cgi
JASPAR
• Profiles
– Transcription factor binding sites
– Multicellular eukaryotes
– Derived from published collections of experiments
• Open data accesss
JASPAR
• profiles
– Modeled as matrices.
– can be converted into PSSM for scanning genomic sequences.
1 2 3 4 5 6 7 8 9 10
A 0 0 0 0 0 0.5 1/6 1/3 0 0
D 0 0.5 1/3 0 0 1/6 5/6 1/6 0 1/6
E 0 0 2/3 1 0 0 0 0 1 5/6
G 0 1/6 0 0 1 1/3 0 0 0 0
H 0 1/6 0 0 0 0 0 0 0 0
N 0 1/6 0 0 0 0 0 0 0 0
Y 1 0 0 0 0 0 0.5 0.5 0 0