![Page 1: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/1.jpg)
Sequence annotation.Cluster analysis.
Meelis [email protected]
Lecture 5Bioinformatics MTAT.03.239
University of Tartu2011 Fall
![Page 2: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/2.jpg)
Sequence Annotation:Highlighting information in
DNA, RNA, Proteins
![Page 3: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/3.jpg)
What enables information in biological sequences?
Aperiodicity
NaCl (table salt)
[sodium chloride] DNA
Periodic crystal -no information
Aperiodic crystal -information
![Page 4: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/4.jpg)
E. Schrödinger 1944:Aperiodic crystals encode genetic information
• Erwin Schrödinger“What is Life? The Physical Aspect of the Living Cell”1944
• In the book, Schrödinger introduced the idea of an "aperiodic crystal" that contained genetic information in its configuration of covalent chemical bonds
• Note that this is before discovery of the helical structure of DNA in 1953 by Watson & Crick
![Page 5: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/5.jpg)
Reading the information
• How is the sequence information read?
• How is it converted into working machinery?
• Why is it important to have A instead of T in some locus?
• Different nucleotides in DNA/RNA and different amino acids in proteins have different physical and chemical properties
![Page 6: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/6.jpg)
Chemical bindingExample: protein-DNA
Protein GCN4 binds on DNA at TGASTCA, (S is G or C)
![Page 7: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/7.jpg)
Transcription regulation:Information processing
DNA
![Page 8: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/8.jpg)
Sequence annotation
• How do the nucleotides in the genome contribute to the life of the cell?
• One of the main objectives in bioinformatics (and biology)
• Sequence annotation is the next major challenge for the Human Genome Project
• ENCyclopedia Of DNA Elements (ENCODE)
![Page 9: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/9.jpg)
Genome (DNA) annotation: UCSC database
Proteome (protein) annotation: UniProt database
Sequence annotation
![Page 10: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/10.jpg)
How to get annotations for the genome?
• Evidence from case studies (Biology)
• Large scale experiments (Biology with Bioinformatics support)
• Predictions based on existing annotations (Bioinformatics)
• Experimental verification of predictions (Biology)
![Page 11: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/11.jpg)
Finding TFBS - Transcription Factor Binding Sites
DNA
![Page 12: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/12.jpg)
Transcription Factor binding on DNA
Protein GCN4 binds on DNA at TGASTCA, (S is G or C)
![Page 13: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/13.jpg)
Substrings, k-mers, k-lets
• GCN4 prefers to bind on 7-mers TGACTCA and TGAGTCA
Terminology:
• 1-mer, monomer, nucleotide, base
• 2-mer, dimer, duplet, dinucleotide
• 3-mer, trimer, triplet, trinucleotide
• ...
• k-mer, k-let, substring/subsequence of length k
![Page 14: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/14.jpg)
IUPAC ambiguity codes
![Page 15: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/15.jpg)
Transcription factor binding site examples
![Page 16: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/16.jpg)
Models for aDNA sequence
• Consensus sequence
• k-mer
• IUPAC k-mer
• RegExp - Regular expression
• e.g. CG[GT]N{5,10}CCG
• PWM - Position Weight Matrix
• HMM - Hidden Markov Model
![Page 17: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/17.jpg)
PWM - Position Weight Matrix• TBP - TATA-binding protein
Binding TATAWAW where W is T or A
A [ 61 16 352 3 354 268 360 222 155 56 83 82 82 68 77 ]C [145 46 0 10 0 0 3 2 44 135 147 127 118 107 101 ]G [152 18 2 2 5 0 20 44 157 150 128 128 128 139 140 ]T [ 31 309 35 374 30 121 6 121 33 48 31 52 61 75 71 ]-----------------------------------------------------------------SUM 389 389 389 389 389 389 389 389 389 389 389 389 389 389 389
PWM logoCount matrix,
not PWM
![Page 18: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/18.jpg)
Count matrix to PWM (1)
• Count matrixA [ 61 16 352 3 354 268 360 222 155 56 83 82 82 68 77 ]C [145 46 0 10 0 0 3 2 44 135 147 127 118 107 101 ]G [152 18 2 2 5 0 20 44 157 150 128 128 128 139 140 ]T [ 31 309 35 374 30 121 6 121 33 48 31 52 61 75 71 ]-----------------------------------------------------------------SUM 389 389 389 389 389 389 389 389 389 389 389 389 389 389 389
for any i = 1, 2, . . . , nS = cAi + cCi + cGi + cTi
cA1 cA2 . . . cAn
cC1 cC2 . . . cCn
cG1 cG2 . . . cGn
cT1 cT2 . . . cTn
![Page 19: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/19.jpg)
Count matrix to PWM (2)
• Probability matrixA [ .16 .05 .87 .02 .88 .67 .89 .56 .39 .15 .22 .21 .21 .18 .20 ]C [ .37 .12 .01 .04 .01 .01 .02 .02 .12 .34 .37 .32 .30 .27 .26 ]G [ .38 .06 .02 .02 .02 .01 .06 .12 .40 .38 .33 .33 .33 .35 .35 ]T [ .09 .77 .10 .93 .09 .31 .03 .31 .09 .13 .09 .14 .16 .20 .19 ]-----------------------------------------------------------------SUM 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
for anyi = 1, 2, . . . , n
X = A,C,G,T
pA1 pA2 . . . pAn
pC1 pC2 . . . pCn
pG1 pG2 . . . pGn
pT1 pT2 . . . pTn
bAbCbGbT
pXi =cXi + bX
√S
S +√S
![Page 20: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/20.jpg)
Count matrix to PWM (3)
• Position weight matrixA [ -0.6 -2.3 +1.8 -3.7 +1.8 +1.4 +1.8 +1.2 +0.6 -0.7 -0.2 -0.2 -0.2 -0.5 -0.3 ]C [ +0.6 -1.0 -4.4 -2.8 -4.4 -4.4 -3.7 -3.9 -1.1 +0.5 +0.6 +0.4 +0.3 +0.1 +0.1 ]G [ +0.6 -2.2 -3.9 -3.9 -3.4 -4.4 -2.0 -1.1 +0.7 +0.6 +0.4 +0.4 +0.4 +0.5 +0.5 ]T [ -1.5 +1.6 -1.4 +1.9 -1.5 +0.3 -3.2 +0.3 -1.4 -0.9 -1.5 -0.8 -0.6 -0.4 -0.4 ]
for anyi = 1, 2, . . . , n
X = A,C,G,T
wA1 wA2 . . . wAn
wC1 wC2 . . . wCn
wG1 wG2 . . . wGn
wT1 wT2 . . . wTn
wXi = log2pXi
bX
![Page 21: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/21.jpg)
Matching a PWM on DNA
• Position weight matrix
A [ -0.6 -2.3 +1.8 -3.7 +1.8 +1.4 ]C [ +0.6 -1.0 -4.4 -2.8 -4.4 -4.4 ]G [ +0.6 -2.2 -3.9 -3.9 -3.4 -4.4 ]T [ -1.5 +1.6 -1.4 +1.9 -1.5 +0.3 ]
C C T T A T +0.6 -1.0 -1.4 +1.9 +1.8 +0.3 = +2.2
Sequence:Score:
PWM of length k gives a score to each k-mer in the DNA
![Page 22: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/22.jpg)
Information content in PWM
• Information content (IC) - how different is a position in PWM from uniform distribution
ICi = log2 4−�
X=A,C,G,T
−pXi log2(pXi)
ICi
i
![Page 23: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/23.jpg)
Observations: Sunny, Rainy, Sunny, Sunny, Rainy, Snowy, Snowy, ...Hidden states: Summer, Fall, Fall, Fall, Fall, Winter, Winter,...
HMMs - Hidden Markov Models
http://webdocs.cs.ualberta.ca/~colinc/cmput606/606FinalPres.ppt
![Page 24: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/24.jpg)
Tasks with HMMsScoring task:
• Given an existing HMM and observed sequence, what is the probability that the HMM generates the sequence
Alignment task:
• Given a sequence, what is the optimal state sequence that the HMM would use to generate it
Training task:
• Given a large amount of data how can we estimate the structure and the parameters of the HMM that best accounts for the data
![Page 25: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/25.jpg)
Hidden Markov Models in Sequence Annotation
ACA---ATG TCAACTATCACAC--AGCAGA---ATCACCG--ATC
http://www.evl.uic.edu/shalini/coursework/hmm.ppt
ACANNNATC
ACA[ACGT]{0,3}ATC
Task: represent the following sequences in a model IUPAC consensus:
Regular expression:PWMHMM
![Page 26: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/26.jpg)
HMMs in gene finding• S - gene start
• D - donor splice site
• E - exon
• I - intron
• A - acceptor splice site
• T - gene termination
![Page 27: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/27.jpg)
HMMs in proteome annotation
HMM logo of hEGF domainfrom .
![Page 28: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/28.jpg)
Common tasks in Genome annotation
• building a model for some DNA feature (or often just learning the parameters)
• searching for sites which match a model (or match it well enough)
• statistical analysis about what features are over-represented in some region of DNA
![Page 29: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/29.jpg)
Common annotated features
• Genes, Transcripts, Exons, Introns
• TFBSs
• CpG islands
• Repeats
• and many more
![Page 30: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/30.jpg)
Clustering
![Page 31: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/31.jpg)
What is Cluster Analysis?
Finding groups of objects such that the objects are:
• similar (or related) to the objects in the same group and
• different from (or unrelated) to the objects in other groups
Short distance
Long distance
![Page 32: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/32.jpg)
Why to cluster biological data?
• Intuition building
• Hypothesis generation
• Summarizing / compressing large data
![Page 33: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/33.jpg)
Partitional vs Hierarchical
Partitional clustering findsa fixed number of clusters
Hierarchical clustering creates a series of clusterings contained in each other
![Page 34: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/34.jpg)
Fuzzy vs Non-Fuzzy
Each object belongs to eachcluster with some weight(the weight can be zero)
Each object belongs to exactly one cluster
![Page 35: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/35.jpg)
Hierarchical clustering
Hierarchical clustering is usually depicted as a dendrogram (tree)
![Page 36: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/36.jpg)
Hierarchical clustering
• Each subtree corresponds to a cluster• Height of branching shows distance
![Page 37: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/37.jpg)
Hierarchical clustering (0)
Algorithm for Agglomerative Hierarchical Clustering:Join the two closest objects
![Page 38: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/38.jpg)
Hierarchical clustering (1)
Join the two closest objects
![Page 39: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/39.jpg)
Hierarchical clustering (2)
Keep joining the closest pairs
![Page 40: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/40.jpg)
Hierarchical clustering (3)
Keep joining the closest pairs
![Page 41: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/41.jpg)
Hierarchical clustering (4)
Keep joining the closest pairs
![Page 42: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/42.jpg)
Hierarchical clustering (5)
Keep joining the closest pairs
![Page 43: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/43.jpg)
Hierarchical clustering (10)
After 10 steps we have 4 clusters left
![Page 44: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/44.jpg)
Hierarchical clustering (10)Several ways to measure distancebetween clusters:• Single linkage (MIN)
![Page 45: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/45.jpg)
Hierarchical clustering (10)Several ways to measure distancebetween clusters:• Single linkage (MIN) • Complete linkage (MAX)
![Page 46: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/46.jpg)
Hierarchical clustering (10)Several ways to measure distancebetween clusters:• Single linkage (MIN) • Complete linkage (MAX)• Average linkage• Weighted• Unweighted• ...
![Page 47: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/47.jpg)
Hierarchical clustering (11)
In this example and at this stage we have the same result as in partitional clustering
![Page 48: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/48.jpg)
Hierarchical clustering (12)
In the final step the two remaining clusters are joined into a single cluster
![Page 49: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/49.jpg)
Hierarchical clustering (13)
In the final step the two remaining clusters are joined into a single cluster
![Page 50: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/50.jpg)
Examples of Hierarchical Clustering in Bioinformatics
PhylogenyGene expression clustering
![Page 51: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/51.jpg)
• Partitional, non-fuzzy
• Partitions the data into K clusters
• K is given by the user
Algorithm:
• Choose K initial centers for the clusters
• Assign each object to its closest center
• Recalculate cluster centers
• Repeat until converges
K-means clustering
![Page 52: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/52.jpg)
K-means (1)
![Page 53: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/53.jpg)
K-means (2)
![Page 54: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/54.jpg)
K-means (3)
![Page 55: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/55.jpg)
K-means (4)
![Page 56: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/56.jpg)
K-means (5)
![Page 57: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/57.jpg)
K-means (6)
![Page 58: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/58.jpg)
• One of the fastest clustering algorithms
• Therefore very widely used
• Sensitive to the choice of initial centres
• many algorithms to choose initial centres cleverly
• Assumes that the mean can be calculated
• can be used on vector data
• cannot be used on sequences (what is the mean of A and T?)
K-means clustering
![Page 59: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/59.jpg)
Distance measuresDistance of vectors and
• Euclidean distance
• Manhattan distance
• Correlation distance
Distance of sequences and
• Hamming distance => 3
• Levenshtein distance
x = (x1, . . . , xn) y = (y1, . . . , yn)
d(x, y) =
����n�
i=1
(xi − yi)2
d(x, y) =n�
i=1
|xi − yi|
d(x, y) = 1− r(x, y)is Pearson
correlation coefficientr(x, y)
ACCTTG TACCTGACCTTGTACCTG
.ACCTTGTACC.TG => 2
![Page 60: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/60.jpg)
• The same as K-means, except that the center is required to be at an object
• Medoid - an object which has minimal total distance to all other objects in its cluster
• Can be used on more complex data, with any distance measure
• Slower than K-means
K-medoids clustering
![Page 61: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/61.jpg)
K-medoids (1)
![Page 62: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/62.jpg)
K-medoids (2)
![Page 63: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/63.jpg)
K-medoids (3)
![Page 64: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/64.jpg)
K-medoids (4)
![Page 65: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/65.jpg)
K-medoids (5)
![Page 66: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/66.jpg)
K-medoids (6)
![Page 67: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/67.jpg)
K-medoids (7)
![Page 68: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/68.jpg)
K-medoids (8)
![Page 69: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/69.jpg)
K-medoids (9)
![Page 70: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/70.jpg)
Examples of K-means and K-medoids in Bioinformatics
Gene expression clustering
Sequence clustering
![Page 71: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/71.jpg)
Applications of sequence clustering• Clustering genes and proteins into families
• UniRef
• Pfam
• Clustering transcription factor binding motifs into families
• Jaspar
![Page 72: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/72.jpg)
MCL - Markov CLuster• A method for clustering a graph by flow simulation
• Based on random walks on graphs:a random walk infrequently goes from one cluster to another
• For example, used for clustering proteins by structural similarity
![Page 73: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,](https://reader034.vdocuments.mx/reader034/viewer/2022050201/5f54b151f5871c673137d2c3/html5/thumbnails/73.jpg)
Summary of Clustering• Aims: intuition, hypothesis generation, summarization
• Types:
• Hierarchical/Partitional
• Fuzzy/Non-Fuzzy
• Vector-based/Distance-based
• etc.
• Distance measures
• Euclidean, Manhattan, Correlation
• Hamming, Levenshtein
• etc.
• Applications:
• Clustering genes, sequences, organisms, etc.