computational genomics and proteomics

Computational Genomics and Proteomics

Lecture 8Lecture 8

Motif DiscoveryMotif Discovery

CENTR

FORINTEGRATIVE

BIOINFORMATICSVU

E

OutlineGene Regulation

DNATranscription factors

MotifsWhat are they?Binding Sites

Combinatoric ApproachesExhaustive searchesConsensus

Comparative GenomicsExample

Probabilistic ApproachesStatisticsEM algorithmGibbs Sampling

www.accessexcellence.org

Four DNA nucleotide building blocks

G-C is more strongly hydrogen-bonded than A-T

Degenerate code

Four bases: A, C, G, T

Two-fold degenerate IUB codes:

R=[AG] -- PurinesY=[CT] -- PyrimidinesK=[GT]M=[AC]S=[GC]W=[AT]

Four-fold degenerate: N=[AGCT]

Transcription Factors

•Required but not a part of the RNA polymerase complex

•Many different roles in gene regulation

Binding

Interaction

Initiation

Enhancing

Repressing

•Various structural classes (eg. zinc finger domains)

•Consist of both a DNA-binding domain and an interactive domain

Short sequences of DNA or RNA (or amino acids)Often consist of 5- 16 nucleotidesMay contain gapsExamples include:

Splice sitesStart/stop codonsTransmembrane domainsCentromeresPhosphorylation sitesCoiled-coil domainsTranscription factor binding sites (TFBS – regulatory motifs)

Motifs

TFBSsDifficult to identifyEach transcription factor may have more than one binding siteDegenerateMost occur upstream of translation start site (TSS) but are known to also occur in:

intronsexons3’ UTRs

Usually occur in clusters, i.e. collections of sites within a region (modules)Often repeatedSites can be experimentally verified

Why are TFBSs important?

Aid in identification of gene networks/pathways

Determine correct network structure

Drug discovery

Switch production of gene product on/off

Gene A Gene B

Consensus sequencesMatches all of the example sequences closely but not exactlyA single site

TACGATA set of sites:

TACGATTATAATTATAATGATACTTATGATTATGTT

Consensus sequence:TATAAT orTATRNT

Trade-off: number of mismatches allowed, ambiguity in consensus sequence and the sensitivity and precision of the representation.

Information Content and Entropy

Sequence Logos

Given a collection of motifs,

TACGATTATAATTATAATGATACTTATGATTATGTT

Create the matrix:

Frequency Matrices

TACG

Position weight matrices

Two problems:Given a collection of known motifs, develop a representation of the motifs such that additional occurrences can reliably be identified in new promoter regionsGiven a collection of genes, thought to be related somehow, find the location of the motif common to all and a representation for it.

Two approaches:CombinatorialProbabilistic

Finding Motifs

Combinatorial Approach

Exhaustive Search

Exhaustive Search

Sample-driven here refers to trying all the words as they occur in the sequences, instead of trying all possible (4W) words exhaustively

Greedy Motif Clustering

Main Idea: Conserved non coding regions are importantAlign the promoters of orthologous co-expressed genes from two (or more) species e.g. human and mouseSearch for TFBS only in conserved regions

Problems:Not all regulatory regions are conservedWhich genomes to use?

Comparative Genomics

Phylogenetic Footprinting

Phylogenetic Footprinting refers to the task of finding conserved motifs across different species. Common ancestry and selection on these motifs has resulted in these “footprints”.

Xie et al. 2005

Genome-wide alignments for four species (human, mouse, rat, dog)

Promoter regions and 3’UTRs then extracted for 17,700 well-annotated genes

Promoter region taken to be (-2000, 2000)

This set of sequences then searched exhaustively for motifs

Phylogenetic Footprinting

An Example

Nature 434, 338-345, 2005

The SearchXie et al. 2005

Expected Rate

Probabilistic Approach

Gibbs Sampling (applied to Motif Finding)

Gibbs Sampling Algorithm

Gibbs Sampling – Motif Positions

AlignACE - Gibbs Sampling

Remainder of the lecture:Maximum likelihood and the EM algorithm

The remaining slides are for your information only and will not be part of the exam

Basic Statistics

Maximum Likelihood Estimates

EM Algorithm

Basic idea (MEME)

http://meme.nbcr.net/meme/meme-intro.html

Basic idea (MEME)MEME is a tool for discovering motifs in a group of related DNA or protein sequences. A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences.

MEME represents motifs as position-dependent letter-probability matrices which describe the probability of each possible letter at each position in the pattern. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs.

MEME takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested. MEME uses statistical modeling techniques to automatically choose the best width, number of occurrences, and description for each motif. http://meme.nbcr.net/meme/meme-intro.html

Basic MEME Model

MEME Background frequencies

MEME – Hidden Variable

MEME – Conditional Likelihood

EM algorithm

Example

E-step of EM algorithm

Example

M-step of EM Algorithm

Example

Characteristics of EM

Gibbs Sampling (versus EM)

computational genomics and proteomics

Documents

collection of motifs

conserved motifs

motif common

collection of known

dnabinding domain

collection of genes

binding sitedegeneratemost

collections of sites