bioinformatics t8-go-hmm v2014

FBW

2-12-2014

Wim Van Criekinge

Gene Prediction, HMM & ncRNA

What to do with an unknown

sequence ?

Gene Ontologies

Gene Prediction

Composite Gene Prediction

Non-coding RNA

HMM

UNKNOWN PROTEIN SEQUENCE

LOOK FOR:

• Similar sequences in databases ((PSI)

BLAST)

• Distinctive patterns/domains associated

with function

• Functionally important residues

• Secondary and tertiary structure

• Physical properties (hydrophobicity, IEP

etc)

BASIC INFORMATION COMES FROM SEQUENCE

• One sequence- can get some information eg

amino acid properties

• More than one sequence- get more info on

conserved residues, fold and function

• Multiple alignments of related sequences-

can build up consensus sequences of known

families, domains, motifs or sites.

• Sequence alignments can give information

on loops, families and function from

conserved regions

Additional analysis of protein sequences

• transmembrane

regions

• signal sequences

• localisation

signals

• targeting

sequences

• GPI anchors

• glycosylation sites

• hydrophobicity

• amino acid

composition

• molecular weight

• solvent accessibility

• antigenicity

FINDING CONSERVED PATTERNS IN PROTEIN SEQUENCES

• Pattern - short, simplest, but limited

• Motif - conserved element of a sequence

alignment, usually predictive of structural or

functional region

To get more information across whole

alignment:

• Profile

• HMM

PATTERNS

• Small, highly conserved regions

• Shown as regular expressions

Example:

[AG]-x-V-x(2)-x-{YW}

– [] shows either amino acid

– X is any amino acid

– X(2) any amino acid in the next 2 positions

– {} shows any amino acid except these

BUT- limited to near exact match in small region

PROFILES

• Table or matrix containing comparison

information for aligned sequences

• Used to find sequences similar to

alignment rather than one sequence

• Contains same number of rows as

positions in sequences

• Row contains score for alignment of

position with each residue

HIDDEN MARKOV MODELS (HMM)

• An HMM is a large-scale profile with gaps,

insertions and deletions allowed in the

alignments, and built around probabilities

• Package used HMMER (http://hmmer.wusd.edu/)

• Start with one sequence or alignment -HMMbuild,

then calibrate with HMMcalibrate, search

database with HMM

• E-value- number of false matches expected with

a certain score

• Assume extreme value distribution for noise,

calibrate by searching random seq with HMM

build up curve of noise (EVD)

HMM

Sequence

http://smart.embl-heidelberg.de/

http://smart.embl-heidelberg.de/

http://www.ebi.ac.uk/interpro

http://www.ebi.ac.uk/interpro

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=37813344&dopt=GenPept&term=insulin+receptor&qty=1


What to do with an unknown

sequence ?

Gene Ontologies

Gene Prediction

HMM


Non-coding RNA

What is an ontology?

• An ontology is an explicit

specification of a conceptualization.

• A conceptualization is an abstract,

simplified view of the world that we

want to represent.

• If the specification medium is a formal representation, the ontology defines the vocabulary.

Why Create Ontologies?

• to enable data exchange among

programs

• to simplify unification (or translation)

of disparate representations

• to employ knowledge-based services

• to embody the representation of a

theory

• to facilitate communication among

people

Summary

• Ontologies are what they do:

artifacts to help people and their

programs communicate, coordinate,

collaborate.

• Ontologies are essential elements in

the technological infrastructure of

the Knowledge Age

• http://www.geneontology.org/

http://www.geneontology.org/

•Molecular Function — elemental activity or task

nuclease, DNA binding, transcription factor

•Biological Process — broad objective or goal

mitosis, signal transduction, metabolism

•Cellular Component — location or complexnucleus, ribosome, origin recognition complex

The Three Ontologies

DAG Structure

Directed acyclic graph: each child may have one or more parents

Example - Molecular Function

Example - Biological Process

Example - Cellular Location

AmiGO browser

GO: Applications

• Eg. chip-data analysis: Overrepresented item

can provide functional clues

• Overrepresentation check: contingency table

– Chi-square test (or Fisher is frequency < 5)


What to do with an unknown sequence ?

Web applications

Gene Ontologies

Gene Prediction

HMM


Non-coding RNA

Problem:

Given a very long DNA sequence, identify coding

regions (including intron splice sites) and their

predicted protein sequences

Computational Gene Finding

Eukaryotic gene structure


• There is no (yet known) perfect method

for finding genes. All approaches rely on

combining various “weak signals”

together

• Find elements of a gene

– coding sequences (exons)

– promoters and start signals

– poly-A tails and downstream signals

• Assemble into a consistent gene model


Gen

efi

nd

er

GENE STRUCTURE INFORMATION - POSITION ON PHYSICAL MAP

This gene structure corresponds to the position on the physical map

GENE STRUCTURE INFORMATION - ACTIVE ZONE

This gene structure shows the Active Zone

The Active Zone limits the extent of

analysis, genefinder & fasta dumps

A blue line within the yellow box

indicates regions outside of the active

zone

The active zone is set by entering

coordinates in the active zone (yellow

box)

GENE STRUCTURE INFORMATION - POSITION

This gene structure relates to the Position:

Change origin of

this scale by

entering a

number in the

green 'origin'

box

GENE STRUCTURE INFORMATION - PREDICTED GENE STRUCTURE

This gene structure relates to the predicted gene structures

Boxes are Exons,

thin lines (or

springs) are Introns

Find the open reading frames

GAAAAAGCTCCTGCCCAATCTGAAATGGTTAGCCTATCTTTCCACCGT

Any sequence has 3 potential reading frames (+1, +2, +3)

Its complement also has three potential reading frames (-1, -2, -3)

6 possible reading frames

The triplet, non-punctuated nature of the genetic code helps us out

64 potential codons

61 true codons

3 stop codons (TGA, TAA, TAG)

Random distribution app. 1/21 codons will be a stop

E K A P A Q S E M V S L S F H R

K K L L P N L K W L A Y L S T

K S S C P I * N G * P I F P P

GENE STRUCTURE INFORMATION - OPEN READING FRAMES

This gene structure relates to Open reading Frames

There is one column

for each frame

Small horizontal

lines represent stop

codons

They have one

column for each

frame

The size indicates

relative score for the

particular start site

GENE STRUCTURE INFORMATION - START CODONS

This gene structure represents Start Codons

• Amino acid distributions are biased

e.g. p(A) > p(C)

• Pairwise distributions also biased

e.g. p(AT)/[p(A)*p(T)] > p(AC)/[p(A)*p(C)]

• Nucleotides that code for preferred amino

acids (and AA pairs) occur more frequently in

coding regions than in non-coding regions.

• Codon biases (per amino acid)

• Hexanucleotide distributions that reflect those

biases indicate coding regions.

Computational Gene Finding: Hexanucleotide frequencies

Gene prediction

Generation of datasets (Ensmart@Ensembl):

Dataset 1 (http://biobix.ugent.be/txt/coding.txt) consists of >900 coding regions (DNA):

Dataset 2 (http://biobix.ugent.be/txt/noncoding.txt) consists of >900 non-coding regions

Distance Array: Calculate for every base all the distances (in bp) to the same nucleotide (focus on the first 1000 bp of the coding region and limit the distance array to a window of 1000 bp)

Do you see a difference in this “distance array” between coding and noncoding sequence ?

Could it be used to predict genes ?

Write a program to predict genes in the following genomic sequence (http://biobix.ugent.be/txt/genomic.txt)

What else could help in finding genes in raw genomic sequences ?

http://www.ensembl.org/Multi/martview?species=Homo_sapiens

http://biobix.ugent.be/txt/coding.txt

http://biobix.ugent.be/txt/noncoding.txt

http://biobix.ugent.be/txt/genomic.txt

GENE STRUCTURE INFORMATION - CODING POTENTIAL

This gene structure corresponds to the Coding Potential

The grey boxes indicate

regions where the codon

frequencies match those of

known C. elegans genes.

the larger the grey box the

more this region resembles a

C. elegans coding element

blastn (EST)

For raw DNA sequence analysis blastx is

extremely useful

Will probe your DNA sequence against the protein database

A match (homolog) gives you some ideas regarding function

One problem are all of the genome sequences

Will get matches to genome databases that are strictly identified by

sequence homology – often you need some experimental evidence

GENE STRUCTURE INFORMATION - SEQUENCE SIMILARITY

This feature shows protein sequence similarity

The blue boxes indicate

regions of sequence which

when translated have

similarity to previously

characterised proteins.

To view the alignment,

select the right mouse

button whilst over the blue

box.

GENE STRUCTURE INFORMATION - EST MATCHES

This gene structure relates to Est Matches

The yellow boxes represent

DNA matches (Blast) to C.

elegans Expressed Sequence

Tags (ESTS)

To view the alignment use the

right mouse button whilst

over the yellow box to invoke

Blixem

Borodovsky et al., 1999, Organization of the Prokaryotic Genome (Charlebois, ed) pp. 11-34

New generation of programs to predict gene coding

sequences based on a non-random repeat pattern(eg. Glimmer, GeneMark) – actually pretty good

• CpG islands are regions of sequence that

have a high proportion of CG dinucleotide

pairs (p is a phoshodiester bond linking

them)

– CpG islands are present in the promoter and

exonic regions of approximately 40% of

mammalian genes

– Other regions of the mammalian genome contain

few CpG dinucleotides and these are largely

methylated

• Definition: sequences of >500 bp with

– G+C > 55%

– Observed(CpG)/Expected(CpG) > 0.65


GENE STRUCTURE INFORMATION - REPEAT FAMILIES

This gene structure corresponds to Repeat Families

This column shows

matches to members of a

number of repeat families

Currently a hidden markov

model is used to detect

these

GENE STRUCTURE INFORMATION - REPEATS

This gene structure relates to Repeats

This column shows regions

of localised repeats both

tandem and inverted

Clicking on the boxes will

show the complete repeat

information in the blue line

at the top end of the screen

Exon/intron boundaries

• Most Eukaryotic introns have a

consensus splice signal: GU at the

beginning (“donor”), AG at the end

(“acceptor”).

• Variation does occur in the splice sites

• Many AGs and GTs are not splice sites.

• Database of experimentally validated

human splice sites:

http://www.ebi.ac.uk/~thanaraj/splice.h

tml

Computational Gene Finding: Splice junctions

GENE STRUCTURE INFORMATION - PUTATIVE SPLICE SITES

This gene structure shows putative splice sites

The Splice Sites are shown

'Hooked'

The Hook points in the

direction of splicing, therefore

3' splice sites point up and 5'

Splice sites point down

The colour of the Splice Site

indicates the position at which

it interrupts the Codon

The height of the Splices is

proportional to the Genefinder

score of the Splice Site


What to do with an unknown sequence ?

Web applications

Gene Ontologies

Gene Prediction

HMM


Non-coding RNA

• Recall that profiles are matrices that

identify the probability of seeing an

amino acid at a particular location in a

motif.

• What about motifs that allow insertions

or deletions (together, called indels)?

• Patterns and regular expressions can

handle these easily, but profiles are

more flexible.

• Can indels be integrated into profiles?

Towards profiles (PSSM) with indels – insertions and/or deletions

• Need a representation that allows

specification of the probability of

introducing (and/or extending) a gap in

the profile.

A .1

C .05

D .2

E .08

F .01

Gap A .04

C .1

D .01

E .2

F .02

Gap A .2

C .01

D .05

E .1

F .06

delete

continue

Hidden Markov Models: Graphical models of sequences

• A sequence is said to be Markovian if the

probability of the occurrence of an element in

a particular position depends only on the

previous elements in the sequence.

• Order of a Markov chain depends on how

many previous elements influence probability

– 0th order: uniform probability at every position

– 1st order: probability depends only on immediately

previous position.

• 1st order Markov chains are good for proteins.

Hidden Markov Chain

Marchov Chain for DNA

Markov chain with begin and end

• Consists of states (boxes) and transitions

(arcs) labeled with probabilities

• States have probability(s) of “emitting” an

element of a sequence (or nothing).

• Arcs have probability of moving from one

state to another.

– Sum of probabilities of all out arcs must be 1

– Self-loops (e.g. gap extend) are OK.

Markov Models: Graphical models of sequences

• Simplest example: Each state emits (or,

equivalently, recognizes) a particular

element with probability 1, and each

transition is equally likely.

Example sequences: 1234 234 14 121214 2123334

Begi

n

Emit 1

Emit 2

Emit 4

Emit 3

End

Markov Models

• Now, add probabilities to each transition (let

emission remain a single element)

• We can calculate the probability of any sequence given this

model by multiplying

0.5

0.50.25

0.75

0.9

0.1

0.2

0.8

1.0Begi

n

Emit 1

Emit 2

Emit 4

Emit 3

End

p(1234) = 0.5 * 0.1 * 0.75 * 0.8 = 0.03

p(14) = 0.5 * 0.9 = 0.45

p(2334)= 0.5 * 0.75 * 0.2 * 0.8 = 0.06

Hidden Markov Models: Probabilistic Markov Models

• If we let the states define a set of emission

probabilities for elements, we can no longer be

sure which state we are in given a particular

element of a sequence

BCCD or BCCD ?

0.5

0.50.25

0.75

0.9

0.1

0.2

0.8

1.0Begi

n

A (0.8) B(0.2)

B (0.7) C(0.3)

C (0.1) D (0.9)

C (0.6) A(0.4)

End

Hidden Markov Models: Probablistic Emmision

• Emission uncertainty means the sequence doesn't

identify a unique path. The states are “hidden”

• Probability of a sequence is sum of all paths that can

produce it:

0.5

0.50.25

0.75

0.9

0.1

0.2

0.8

1.0Begi

n

A (0.8) B(0.2)

B (0.7) C(0.3)

C (0.1) D (0.9)

C (0.6) A(0.4)

End

p(bccd) = 0.5 * 0.2 * 0.1 * 0.3 * 0.75 * 0.6 * 0.8 * 0.9

+ 0.5 * 0.7 * 0.75 * 0.6 * 0.2 * 0.6 * 0.8 * 0.9= 0.000972 + 0.013608 = 0.01458

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models: The occasionally dishonest casino

• The HMM must first be “trained” using a training set– Eg. database of known genes.

– Consensus sequences for all signal sensors are needed.

– Compositional rules (i.e., emission probabilities) and length distributions are necessary for content sensors.

• Transition probabilities between all connected states must be estimated.

• Estimate the probability of sequence s, given model m, P(s|m)

– Multiply probabilities along most likely path(or add logs – less numeric error)

Use of Hidden Markov Models

• HMMs are effectively profiles with gaps, and

have applications throughout Bioinformatics

• Protein sequence applications:

– MSAs and identifying distant homologs

E.g. Pfam uses HMMs to define its MSAs

– Domain definitions

– Used for fold recognition in protein structure

prediction

• Nucleotide sequence applications:

– Models of exons, genes, etc. for gene

recognition.

Applications of Hidden Markov Models

• UC Santa Cruz (David Haussler group)

– SAM-02 server. Returns alignments, secondary

structure predictions, HMM parameters, etc. etc.

– SAM HMM building program

(requires free academic license)

• Washington U. St. Louis (Sean Eddy group)

– Pfam. Large database of precomputed HMM-based

alignments of proteins

– HMMer, program for building HMMs

• Gene finders and other HMMs (more later)

Hidden Markov Models Resources

Example TMHMM

Beyond Kyte-Doolitlle …

HMM in protein analysis

• http://www.cse.ucsc.edu/research/compbio/is

mb99.handouts/KK185FP.html

http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html

Hidden Markov model for gene structure

• A representation of the linguistic rules for what features might follow what other features when parsing a sequence consisting of a multiple exon gene.

• A candidate gene structure is created by tracing a path from B to F.

• A hidden Markov model (or hidden semi-Markov model) is defined by attaching stochastic models to each of the arcs and nodes.

Signals (blue nodes):

• begin sequence (B)

• start translation (S)

• donor splice site (D)

• acceptor splice site (A)

• stop translation (T)

• end sequence (F)

Contents (red arcs):

• 5’ UTR (J5’)

• initial exon (EI)

• exon (E)

• intron (I)

• final exon (EF)

• single exon (ES)

• 3’ UTR (J3’)

Classic Programs for gene finding

Some of the best programs are HMM based:• GenScan – http://genes.mit.edu/GENSCAN.html

• GeneMark – http://opal.biology.gatech.edu/GeneMark/

Other programs• AAT, EcoParse, Fexeh, Fgeneh, Fgenes, Finex, GeneHacker, GeneID-3,

GeneParser 2, GeneScope, Genie, GenLang, Glimmer, GlimmerM, Grail II, HMMgene, Morgan, MZEF, Procrustes, SORFind, Veil, Xpound

http://genes.mit.edu/GENSCAN.html

http://opal.biology.gatech.edu/GeneMark/

GENSCANnot to be confused with GeneScan, a commercial product

• A Semi-Markov Model

– Explicit model of how long

to stay in a state (rather

than just self-loops, which

must be exponentially

decaying)

• Tracks “phase” of exon or

intron (0 coincides with codon

boundary, or 1 or 2)

• Tracks strand (and direction)

Hidden Markov Models: Gene Finding Software

Conservation of Gene Features

Conservation pattern across 3165 mappings of human RefSeq mRNAs to the genome. A program sampled 200 evenly spaced bases across 500 bases upstream of transcription, the 5’ UTR, the first coding exon, introns, middle coding exons, introns, the 3’ UTR and 500 bases after polyadenylatoin. There are peaks of conservation at the transition from one region to another.

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

aligning identity

Composite Approaches

• Use EST info to constrain HMMs (Genie)

• Use protein homology info on top of HMMs

(fgenesh++, GenomeScan)

• Use cross species genomic alignments on top

of HMMs (twinscan, fgenesh2, SLAM, SGP)

Gene Prediction: more complex …

1. Species specific

2. Splicing enhancers found in coding regions

3. Trans-splicing

4. …

Length preference

5’ ss intcomp branch 3’ ss

Co

nte

nts

-Sche

du

le

RNA genes

Besides the 6000 protein coding-genes, there is:

140 ribosomal RNA genes

275 transfer RNA gnes

40 small nuclear RNA genes

>100 small nucleolar genes

?

pRNA in 29 rotary packaging motor (Simpson

et el. Nature 408:745-750,2000)

Cartilage-hair hypoplasmia mapped to an RNA

(Ridanpoa et al. Cell 104:195-203,2001)

The human Prader-Willi ciritical region (Cavaille

et al. PNAS 97:14035-7, 2000)

RNA genes can be hard to detects

UGAGGUAGUAGGUUGUAUAGU

C.elegans let-27; 21 nt

(Pasquinelli et al. Nature 408:86-89,2000)

Often small

Sometimes multicopy and redundant

Often not polyadenylated

(not represented in ESTs)

Immune to frameshift and nonsense mutations

No open reading frame, no codon bias

Often evolving rapidly in primary sequence

miRNA genes

• Lin-4 identified in a screen for mutations that affect timing and

sequence of postembryonic development in C.elegans. Mutants re-

iterate L1 instead of later stages of development

• Gene positionally cloned by isolating a 693-bp DNA fragment that

can rescue the phenotype of mutant animals

• No protein found but 61-nucleotide precursor RNA with stem-loop

structure which is processed to 22-mer ncRNA

• Genetically lin-4 acts as negative regulator of lin-14 and lin-28

• The 3’ UTR of the target genes have short stretches of

complementarity to lin-4

• Deletion of these lin-4 target seq causes unregulated gof phenotype

• Lin-4 RNA inhibits accumulation of LIN-14 and LIN-28 proteins

although the target mRNA

Lin-4

Let-7 (lethal-7) was also mapped to a ncRNA gene with a 21-

nucleotide product

The small let-7 RNA is also thought to be a post-transcriptional

negative regulator for lin-41 and lin-42

100% conserved in all bilaterally symmetrical animals (not

jellyfish and sponges)

Sometimes called stRNAs, small temporal RNAs

Let-7(Pasquinelli et al. Nature 408:86-89,2000)

Two computational analysis problems

• Similarity search (eg BLAST), I give you a query, you find sequences in a database that look like the query (note: SW/Blat)

– For RNA, you want to take the secondary structure of the query into account

• Genefinding. Based solely on a priori knowledge of what a “gene” looks like, find genes in a genome sequence

– For RNA, with no open reading frame and no codon bias, what do you look for ?

Basic CFG

“production rules”

S -> aS

S -> Sa

S -> aSu

S -> SS

Context-free grammers

A CFG “derivation”

S -> aS

Basic CFG


S -> aS

S -> Sa

S -> aSu

S -> SS



S -> aS

S -> aaS

Basic CFG


S -> aS

S -> Sa

S -> aSu

S -> SS



S -> aS

S -> aaS

S -> aaSS

Basic CFG


S -> aS

S -> Sa

S -> aSu

S -> SS



S -> aS

S -> aaS

S -> aaSS

S -> aagScuS

Basic CFG


S -> aS

S -> Sa

S -> aSu

S -> SS



S -> aS

S -> aaS

S -> aaSS

S -> aagScuS

S -> aagaSucugSc

Basic CFG


S -> aS

S -> Sa

S -> aSu

S -> SS



S -> aS

S -> aaS

S -> aaSS

S -> aagScuS

S -> aagaSucugSc

S -> aagaSaucuggScc

S -> aagacSgaucuggcgSccc

Basic CFG


S -> aS

S -> Sa

S -> aSu

S -> SS



S -> aS

S -> aaS

S -> aaSS

S -> aagScuS

S -> aagaSucugSc

S -> aagaSaucuggScc


S -> aagacuSgaucuggcgSccc

S -> aagacuuSgaucuggcgaSccc

S -> aagacuucSgaucuggcgacSccc

S -> aagacuucgSgaucuggcgacaSccc

S -> aagacuucggaucuggcgacaccc

Basic CFG


S -> aS

S -> Sa

S -> aSu

S -> SS



S -> aS

S -> aaS

S -> aaSS

S -> aagScuS

S -> aagaSucugSc

S -> aagaSaucuggScc


S -> aagacuSgaucuggcgSccc

S -> aagacuuSgaucuggcgaSccc

S -> aagacuucSgaucuggcgacSccc

S -> aagacuucgSgaucuggcgacaSccc

S -> aagacuucggaucuggcgacaccc

A

C

G

U

*

A

AA

A

A

GG

G G G

C

C

C

C

CCC

U

U

U

*

*

* * *

The power of comparative analysis

• Comparative genome analysis is an indispensable means of inferring whether a locus produces a ncRNA as opposed to encoding a protein.

• For a small gene to be called a protein-coding gene, one excellent line of evidence is that the ORF is significantly conserved in another related species.

• It is more difficult to positively corroborate a ncRNA by comparative analysis but, in at least some cases, a ncRNA might conserve an intramolecular secondary structure and comparative analysis can show compensatory base substitutions.

• With comparative genome sequence data now accumulating in the public domain for most if not all important genetic systems, comparative analysis can (and should) become routine.

Compensatory substitutions

that maintain the structure

U U

C G

U A

A U

G C

A UCGAC 3’

G C

5’

Evolutionary conservation of RNA molecules can be revealed

by identification of compensatory substitutions

…………

• Manual annotation of 60,770 full-length mouse complementary

DNA sequences, clustered into 33,409 ‘transcriptional units’,

contributing 90.1% of a newly established mouse transcriptome

database.

• Of these transcriptional units, 4,258 are new protein-coding and

11,665 are new non-coding messages, indicating that non-coding

RNA is a major component of the transcriptome.

Function on ncRNAs

ncRNAs & RNAi

Therapeutic Applications

• Shooting millions of tiny RNA molecules into a mouse’s bloodstream can protect its liver from the ravages of hepatitis, a new study shows. In this case, they blunt the liver’s selfdestructive inflammatory response, which can be triggered by agents such as the hepatitis B or C viruses. (Harvard University immunologists Judy Lieberman and Premlata Shankar)

• In a series of experiments published online this week by Nature Medicine, Lieberman’s team gave mice injections of siRNAs designed to shut down a gene called Fas. When overactivated during an inflammatory response, it induces liver cells to self-destruct. The next day, the animals were given an antibody that sends Fas into hyperdrive. Control mice died of acute liver failure within a few days, but 82% of the siRNA-treated mice remained free of serious disease and survived. Between 80% and 90% of their liver cells had incorporated the siRNAs.

bioinformatics t8-go-hmm v2014

Education

unknown sequence

sequence alignments

sequenceone sequence

amino acid propertiesmore

conserved regionsshown

conserved residues

similar sequences

sequences similar