sequence analysis. cluster . 2010-09-30آ  sequence analysis. cluster analysis. meelis kull @ut.ee

Download Sequence analysis. Cluster . 2010-09-30آ  Sequence analysis. Cluster analysis. Meelis Kull  @ut.ee

Post on 12-Feb-2020

2 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Sequence analysis. Cluster analysis.

    Meelis Kull meelis.kull@ut.ee

    Lecture 5 Bioinformatics MTAT.03.239

    University of Tartu 2010 Fall

    mailto:meelis.kull@ut.ee mailto:meelis.kull@ut.ee

  • Sequence Analysis: DNA, RNA, Protein

  • Aperiodic crystals store information

    • Erwin Schrödinger “What is Life? The Physical Aspect of the Living Cell” 1944

    • In the book, Schrödinger introduced the idea of an "aperiodic crystal" that contained genetic information in its configuration of covalent chemical bonds

    • Note that this is before discovery of the helical structure of DNA in 1953 by Watson & Crick

    http://en.wikipedia.org/wiki/Chemical_bond http://en.wikipedia.org/wiki/Chemical_bond

  • Sequence Information

    • How is the sequence information used/ read? How is it converted into working machinery? Why is it important to have A instead of T in some locus?

    • Different nucleotides in DNA/RNA and amino acids in proteins have different physical (and chemical) properties

  • Chemical binding Example: protein-DNA

    Protein GCN4 binds on DNA at TGASTCA, (S is G or C)

  • Transcription regulation: Information processing

    DNA

  • Sequence annotation

    • How do the nucleotides in the genome contribute to the life of the cell?

    • One of the main objectives in bioinformatics (and biology)

    • Sequence annotation is the next major challenge for the Human Genome Project

    • ENCyclopedia Of DNA Elements (ENCODE)

  • Sequence annotation

  • How to get annotations for the genome?

    • Evidence from case studies (Biology)

    • Large scale experiments (Biology with Bioinformatics support)

    • Predictions based on existing annotations (Bioinformatics)

    • Experimental verification of predictions (Biology)

  • Finding TFBS - Transcription Factor Binding Sites

    DNA

  • Transcription Factor binding on DNA

    Protein GCN4 binds on DNA at TGASTCA, (S is G or C)

  • Substrings, k-mers, k-lets

    • GCN4 prefers to bind on 7-mers TGACTCA and TGAGTCA

    Terminology:

    • 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet, dinucleotide • 3-mer, trimer, triplet, trinucleotide • ... • k-mer, k-let, substring/subsequence of length k

  • IUPAC ambiguity codes

  • Transcription factor binding site examples

  • Models for a DNA sequence

    • Consensus sequence • k-mer • IUPAC k-mer • RegExp - Regular expression • e.g. CG[GT]N{5,10}CCG • PWM - Position Weight Matrix • HMM - Hidden Markov Model

  • PWM - Position Weight Matrix • TBP - TATA-binding protein

    Binding TATAWAW where W is T or A

    A [ 61 16 352 3 354 268 360 222 155 56 83 82 82 68 77 ] C [145 46 0 10 0 0 3 2 44 135 147 127 118 107 101 ] G [152 18 2 2 5 0 20 44 157 150 128 128 128 139 140 ] T [ 31 309 35 374 30 121 6 121 33 48 31 52 61 75 71 ] ----------------------------------------------------------------- SUM 389 389 389 389 389 389 389 389 389 389 389 389 389 389 389

    PWM logo Count matrix,

    not PWM

  • Count matrix to PWM (1)

    • Count matrix A [ 61 16 352 3 354 268 360 222 155 56 83 82 82 68 77 ] C [145 46 0 10 0 0 3 2 44 135 147 127 118 107 101 ] G [152 18 2 2 5 0 20 44 157 150 128 128 128 139 140 ] T [ 31 309 35 374 30 121 6 121 33 48 31 52 61 75 71 ] ----------------------------------------------------------------- SUM 389 389 389 389 389 389 389 389 389 389 389 389 389 389 389

    for any i = 1, 2, . . . , nS = cAi + cCi + cGi + cTi

    

    cA1 cA2 . . . cAn cC1 cC2 . . . cCn cG1 cG2 . . . cGn cT1 cT2 . . . cTn

    

  • Count matrix to PWM (2)

    • Probability matrix A [ .16 .05 .87 .02 .88 .67 .89 .56 .39 .15 .22 .21 .21 .18 .20 ] C [ .37 .12 .01 .04 .01 .01 .02 .02 .12 .34 .37 .32 .30 .27 .26 ] G [ .38 .06 .02 .02 .02 .01 .06 .12 .40 .38 .33 .33 .33 .35 .35 ] T [ .09 .77 .10 .93 .09 .31 .03 .31 .09 .13 .09 .14 .16 .20 .19 ] ----------------------------------------------------------------- SUM 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

    for any i = 1, 2, . . . , n

    X = A,C,G,T

    

    pA1 pA2 . . . pAn pC1 pC2 . . . pCn pG1 pG2 . . . pGn pT1 pT2 . . . pTn

    

    

    bA bC bG bT

    

    pXi = cXi + bX

    √ S

    S + √ S

  • Count matrix to PWM (3)

    • Position weight matrix A [ -0.6 -2.3 +1.8 -3.7 +1.8 +1.4 +1.8 +1.2 +0.6 -0.7 -0.2 -0.2 -0.2 -0.5 -0.3 ] C [ +0.6 -1.0 -4.4 -2.8 -4.4 -4.4 -3.7 -3.9 -1.1 +0.5 +0.6 +0.4 +0.3 +0.1 +0.1 ] G [ +0.6 -2.2 -3.9 -3.9 -3.4 -4.4 -2.0 -1.1 +0.7 +0.6 +0.4 +0.4 +0.4 +0.5 +0.5 ] T [ -1.5 +1.6 -1.4 +1.9 -1.5 +0.3 -3.2 +0.3 -1.4 -0.9 -1.5 -0.8 -0.6 -0.4 -0.4 ]

    for any i = 1, 2, . . . , n

    X = A,C,G,T

    

    wA1 wA2 . . . wAn wC1 wC2 . . . wCn wG1 wG2 . . . wGn wT1 wT2 . . . wTn

    

    wXi = log2 pXi bX

  • Matching a PWM on DNA

    • Position weight matrix A [ -0.6 -2.3 +1.8 -3.7 +1.8 +1.4 ] C [ +0.6 -1.0 -4.4 -2.8 -4.4 -4.4 ] G [ +0.6 -2.2 -3.9 -3.9 -3.4 -4.4 ] T [ -1.5 +1.6 -1.4 +1.9 -1.5 +0.3 ]

    C C T T A T +0.6 -1.0 -1.4 +1.9 +1.8 +0.3 = +2.2

    Sequence: Score:

    PWM of length k gives a score to each k-mer in the DNA

  • Information content in PWM

    • Information content - how different is a PWM from uniform distribution

    IC

    IC = log2 4− �

    i=1,...,n X=A,C,G,T

    −pXi log2(pXi)

  • Observations: Sunny, Rainy, Sunny, Sunny, Rainy, Snowy, Snowy, ... Hidden states: Summer, Fall, Fall, Fall, Fall, Winter, Winter,...

    HMMs - Hidden Markov Models

    http://webdocs.cs.ualberta.ca/~colinc/cmput606/606FinalPres.ppt

    http://webdocs.cs.ualberta.ca/~colinc/cmput606/606FinalPres.ppt http://webdocs.cs.ualberta.ca/~colinc/cmput606/606FinalPres.ppt

  • Tasks with HMMs Scoring task:

    • Given an existing HMM and observed sequence, what is the probability that the HMM generates the sequence

    Alignment task:

    • Given a sequence, what is the optimal state sequence that the HMM would use to generate it

    Training task:

    • Given a large amount of data how can we estimate the structure and the parameters of the HMM that best accounts for the data

  • Hidden Markov Models in Sequence Analysis

    ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC

    http://www.evl.uic.edu/shalini/coursework/hmm.ppt

    ACANNNATC

    ACA[ACGT]{0,3}ATC

    Task: represent the following sequences in a model IUPAC consensus:

    Regular expression: PWM HMM

    http://www.evl.uic.edu/shalini/coursework/hmm.ppt http://www.evl.uic.edu/shalini/coursework/hmm.ppt

  • HMMs in gene finding • S - gene start • D - donor splice site • E - exon • I - intron • A - acceptor splice site • T - gene termination

  • DNA sequence analysis

    ... is mostly about:

    • building a model for some DNA feature (or often just learning the parameters)

    • searching for sites which match a model (or match it well enough)

    • statistical analysis about what features are over-represented in some region of DNA

  • Common tasks in DNA sequence analysis • GC-content • CpG islands • Masking repeats • Melting temperature • PCR primer design • TFBS motif discovery • TFBS search • Gene search • Codon usage • Sequence alignment • and many more

  • Clustering

  • What is Cluster Analysis?

    Finding groups of objects such that the objects are:

    • similar (or related) to the objects in the same group and • different from (or unrelated) to the objects in other groups

    Short distance

    Long distance

  • Why to cluster biological data?

    • Intuition building • Hypothesis generation • Summarizing / compressing large data

  • Partitional vs Hierarchical

    Partitional clustering finds a fixed number of clusters

    Hierarchical clustering creates a series of clusterings contained in each other

  • Fuzzy vs Non-Fuzzy

    Each object belongs to each cluster with some weight (the weight can be zero)

    Each object belongs to exactly one cluster

  • Hierarchical clustering

    Hierarchical clustering is usually depicted as a dendrogram (tree)

  • Hierarchical clustering

    • Each subtree corresponds to a cluster • Height of branching shows distance

  • Hierarchical clustering (0)

    Algorithm for Agglomerative Hierarchical Clustering: Join the two closest objects

  • Hierarchical clustering (1)

    Join the two closest objects

  • Hierarchical clustering (2)

    Keep joining the closest pairs

  • Hierarchical clustering (3)

    Keep joining the closest pairs

Recommended

View more >