-translation biological sequence analysisbio.lundberg.gu.se/courses/ht08/bio1/seq_2008.pdf ·...
TRANSCRIPT
1
Biological sequence analysis
Tore Samuelsson 26 sep 2008 ----- ----- ------ ----- -----A V S L M
G C A G T A A G C T T G A T G DNA
Protein
Concepts in sequence analysis- Translation
HindIIIG C A G T A A G C T T G A T G
Concepts in sequence analysis- Pattern matching
(example: identification of restriction sites)G C A G T A A G C T T G A T G
P(A) = 0.23P(T) = 0.24P(C) = 0.26P(G) = 0.27
Concepts in sequence analysisProbabilistic models
2
0.11 0.74 1.00 0.00 0.29 0.64 0.09 0.00 0.00 0.61 0.13 0.12 0.00 1.00 0.07 0.11 0.06 0.00 0.00 0.02
A G - G U A
GAUC
3' end of exon 5' end of intron
Concepts in sequence analysisProbabilistic models
Modeling of splicing signals A A G G G U U C G A U U C C C U U
tRNA
Concepts in sequence analysis- long range dependencies
G C A G T A A G C T T G A T GG C A G T A A - C T T T A T G* * * * * * * * * * * * *
Concepts in sequence analysis- Alignments Why are sequence alignments important?
• Sequence assembly• Prediction of function • Protein family analysis• Comparative genomics• Phylogeny / Evolutionary history
3
Sequence assembly
We have a ‘new’ sequence. It is similar to a previously known sequence?
Alignment to all previously known sequences. (Many of these have annotation such as a description of function )
similarity
?
no similarity
Prediction of function
Prediction of function
Predicting the molecular basis of disease
BRCA1 gene - genetic factor in breast cancer
Cloning and sequencing of the gene revealed a protein remotely related to a yeast protein (Rad9) involved in cell cycle control
RAD9_YEAST : GNVFDKCIFVLTS-LFENReELRQTIESQGGTVIeSGfstlfnfthplakslvnkgntdnBRC1_HUMAN : ERVNKRMSMVVSGLTPEEFmLVYKFARKHHITLTnLI-----------------------
RAD9_YEAST : irelalklawkphslfaDCRFACLITKRHLrSLKYLET------LALGWPTLHWKFISAC BRC1_HUMAN : -----------------TEETTHVVMKTDA-EFVCERTLKyflGIAGGKWVVSYFWVTQS
RAD9_YEAST : IEKKRIVPHLIYQYBRC1_HUMAN : IKERKMLNEHDFEV
Prediction of functionAnalysis of the human genome reveals a large
number of olfactory receptor proteins
Protein family analysis
4
Comparative genomics - reveals biologically significant regions of the genome
Human evolutionOrigin of manClosest primate relatives of man?Did modern humans originate in Africa?Relationship between human populationsHow are genes of medical interest
distributed among human populations?
Evolution of viruses / microorganisms that cause human disease
HIV (AIDS)H5N1 (Bird Flu)
Phylogeny / Evolutionary history
Margaret Dayhoff
Early days of sequence analysisSubstitution matrixScoring of alignments
A G L C E| | | | |A A L C D4+ 0+4 +9+2 =19
5
BLAST - searches in databases for sequence similarity, p. 93-103ClustalW - multiple alignment of sequences, p. 89-93
Frequently used methods in sequence analysis that are based on the principle of
sequence alignment using dynamic programming
FASTA, 1988William Pearson
BLAST
David LipmanStephen Altschul
BLAST, 1990
Searching databases for sequence similarity- local alignment using
Smith-Waterman method is too slow
M A K L Q G A L G K R Y
M *A * *K * *I
Q *G * *A * *L * *A * * K * *R *Y
BLAST
Improvement of speed as compared to local alignment algorithm:
• Initial search is for word hits. • Word hits are then extended in either direction.
Searching databases for sequence similarity- heuristics of BLAST
First step in BLAST - obtaining a list of words based on the query sequence
Query sequence: MSGTWAMA ....
Words derived from query sequence:MSG, SGT, GTW, TWA ....etc
Each word extracted from the query sequenceis matched against words derived from the database sequence.
BLAST
6
First step in BLAST - obtaining a list of words based on the query sequence -improving sensitivity by considering 'word neighbors'
Consider the word GTW:
compile a list of words scoring at least T with query word:
GTW (6+5+11=22)GSW (6+1+11=18)GNW (6+0+11=17)GAW (6+0+11=17)ATW (0+5+11=16)DTW (-1+5+11=15)GTF (6+5+1=12)
GTM (6+5-1=10)DAW (-1+0+11=10)
threshold T
BLAST
exact matches to these words will be searched
against database sequence
BLASTP 2.0.11 [Jan-20-2000]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.
Query= ramp4.seq(75 letters)
Database: nr457,798 sequences; 140,871,481 total letters
Searching..................................................done
Score ESequences producing significant alignments: (bits) Value
gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associated membr... 126 2e-29gi|3851666 (AF100470) ribosome attached membrane protein 4 [Rat... 126 2e-29gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;... 74 1e-13gi|3935169 (AC004557) F17L21.12 [Arabidopsis thaliana] 46 3e-05gi|3935171 (AC004557) F17L21.14 [Arabidopsis thaliana] 36 0.048gi|5921764|sp|O13394|CHS5_USTMA CHITIN SYNTHASE 5 (CHITIN-UDP A... 29 3.6
BLASTBLAST output
Parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. Essentially, the E value describes the random background noise that exists for matches between sequences. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to "0" the more "significant" the match is.
BLAST
Expect value (E) >gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;cDNA EST EMBL:D71338 comes from this gene; cDNA ESTEMBL:D74010 comes from this gene; cDNA EST EMBL:D74852comes from this gene; cDNA EST EMBL:C07354 comes fromthis gene; cDNA EST EMBL:C0...Length = 65
Score = 74.1 bits (179), Expect = 1e-13Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)
Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73QR+ +AN++ SKN+ RGNVAK+ + A E+K PWL+ LF+FVVCGSA+F+II+ ++M
Sbjct: 5 QRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVCGSAVFEIIRYVKM 63
Query: 74 G 74G
Sbjct: 64 G 64
BLAST
High Scoring Pair (HSP)
High Scoring Pair (HSP)
7
Query Database
blastp Protein Proteinblastn DNA DNAtblastn Protein DNAblastx DNA Proteintblastx DNA DNA
The different variants of BLASTBLAST
The variants of BLAST
blastall -i input_sequence -d database -p blast_version
bl2seq uses the BLAST algorithm but matches two sequencesinstead of matching one sequence against a database
bl2seq -i 1st_sequence -j 2nd_sequence -p blast_version
BLASTUsing BLAST in a unix environment
Large scale alignments and further improvement of computational efficiency - BLAT
(http://genome.ucsc.edu/cgi-bin/hgBlat?command=start)
8
BLAST will reveal evolutionary relationships. DNA or proteinsequences are homologous if they are related by divergence from a common ancestor.
Two kinds of homology:
Orthology Sequences that diverged after a speciation event.Orthologous genes often have the samefunction in different species.
Paralogy Sequences that diverged after a duplication event.Paralogous genes perform different but related functions within one organism.
Evolutionary relationships revealed by database searchesX
X
X1
X
X2
Speciation
Times goes on ...
Orthologs
Ancestral organism
Organism A
Organism A
Organism B
Organism B
Orthologs
X
X
Xa
X
Xb
Gene duplication
Times goes on ...
Paralogs
Paralogs
Mouse trypsin -- orthologs -- Human trypsin| | |
paralogs paralogs| |
Mouse chymotrypsin -- orthologs -- Human chymotrypsin
Example of orthology / paralogy relationships
9
ClustalW
• Construction of tree based on pairwise alignments• Progressive alignment guided by tree.
AB
CD
E
Viruses - dependent on living cells for propagation
HIV
Introduction to the "Exercises with biological sequences -examining HIV genes and proteins"
Introduction to the "Exercises with biological sequences -examining HIV genes and proteins"
10
EMBOSS programs in this practical
sixpackplotorf
water - Smith Waterman alignmentneedle - Needleman - Wunsch alignmentdottup - dotplot analysis
M A K R K L K K N L K T F V A F S A I T F1W Q R E S * K R T * K L L L H L V L L L F2G K E K V K K E L K N F C C I * C Y Y C F3
1 ATGGCAAAGAGAAAGTTAAAAAAGAACTTAAAAACTTTTGTTGCATTTAGTGCTATTACT 60----:----|----:----|----:----|----:----|----:----|----:----|
1 TACCGTTTCTCTTTCAATTTTTTCTTGAATTTTTGAAAACAACGTAAATCACGATAATGA 60X A F L F N F F F K F V K T A N L A I V F6X P L S F T L F S S L F K Q Q M * H * * F5
H C L S L * F L V * F S K N C K T S N S F4
A L L L T N G I P I S A L T Q S S N T T F1L Y C * L M V F Q L V L * L S L P I Q L F2F I V N * W Y S N * C F N S V F Q Y N * F3
61 GCTTTATTGTTAACTAATGGTATTCCAATTAGTGCTTTAACTCAGTCTTCCAATACAACT 120----:----|----:----|----:----|----:----|----:----|----:----|
61 CGAAATAACAATTGATTACCATAAGGTTAATCACGAAATTGAGTCAGAAGGTTATGTTGA 120A K N N V L P I G I L A K V * D E L V V F6Q K I T L * H Y E L * H K L E T K W Y L F5
S * Q * S I T N W N T S * S L R G I C S F4
E I T S Q A T T G L R N V M Y Y G D W S F1R L L H K L L Q G Y V M * C I M V T G L F2D Y F T S Y Y R V T * C N V L W * L V Y F3
121 GAGATTACTTCACAAGCTACTACAGGGTTACGTAATGTAATGTATTATGGTGACTGGTCT 180----:----|----:----|----:----|----:----|----:----|----:----|
121 CTCTAATGAAGTGTTCGATGATGTCCCAATGCATTACATTACATAATACCACTGACCAGA 180S I V E C A V V P N R L T I Y * P S Q D F6Q S * K V L * * L T V Y H L T N H H S T F5
L N S * L S S C P * T I Y H I I T V P R F4
Translation of a nucleotide sequence using ‘sixpack’
Deviations from the standard genetic code
# Yeast mitochondria
UGA = Trp:WCUU = Thr:TCUC = Thr:TCUA = Thr:TCUG = Thr:TAUA = Met:M
# Mammalian mitochondria
UGA = Trp:WAUU = Ile:IAUC = Ile:IAUA = Met:MAGA = * :*AGG = * :*
# Drosophila mitochondria
UGA = Trp:WAUU = Ile:IAUA = Met:MAGA = Ser:SAGG = Ser:S
# Mycoplasma
UGA = Trp
# Cilian protozoa
UAA = Gln:QUAG = Gln:Q
Plotorf to show open reading frames(in this case ORF is defined as starting with AUG codon)
Ribosomal protein S16 1771-2019
Ribosomal protein L19 3426-3773
Unnamed protein 416-1522 tRNA methyltransferase 2617-3384
11
Introduction to the "Exercises with biological sequences -examining HIV genes and proteins"
Gag
Gag-Pol fusion(5%)
Global alignment of mRNA sequence to genomic DNA sequence
Effect of gap parameters
mature, spliced mRNA
genomic DNA
Global alignment of mRNA sequence to genomic DNA sequence
Effect of gap parameters
Dot plot analysis (dottup) reveals repeats
12
Introduction to the "Exercises with biological sequences -examining HIV genes and proteins"
- important biological questions addressed.
BLAST
* Identifying orthologues and paralogues. * Non-viral homologues to any HIV proteins?* Are we able to identify a relationship between human HIV
and the monkey SIV?
ClustalW
* How does HIV drug resistance develop?* What is the origin of HIV - relationship to monkey SIV?* Using a multiple alignment to compute a phylogenetic tree