finding a common motif of rna sequences using genetic programming the gernamo system shahar michal,...

85
Finding a Common Motif of RNA Sequences Using Genetic Programming The GeRNAMo System Shahar Michal, Tor Ivry, Omer S. Cohen, Moshe Sipper, and Danny Barash

Post on 21-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Finding a Common Motif of RNA Sequences

Using Genetic Programming

The GeRNAMo System

Shahar Michal, Tor Ivry, Omer S. Cohen, Moshe Sipper, and Danny Barash

GeRNAMo

Genetic programming of RNA Motifs. An evolutionary computation system that

searches for a motif which is common to a given data set of RNA sequences.

Geronimo

Goyalkla, or "One Who Yawns," is known to most non-Indians by his Spanish nickname, Geronimo, the greatest Apache warrior of his time.

When he was a young man, Mexican soldiers had murdered his wife and children during a brutal attack on his village in Chihuahua, Mexico. Though Geronimo later remarried and fathered other children, the scars of that early tragedy left him with an abiding hatred for Mexicans.

Operating in the border region around Mexico's Sierra Madre and southern Arizona and New Mexico, Geronimo and his band of 50 Apache warriors succeeded in keeping white settlers off Apache lands for decades. Geronimo never learned to use a gun, yet he armed his men with the best modern rifles he could obtain and used field glasses to aid reconnaissance during his campaigns. He was a brilliant strategist who used the Apache knowledge of the arid desert environment to his advantage, and for years Geronimo and his men successfully evaded two of the US Army's most talented Indian fighters, General George Crook and General Nelson A. Miles.

Layout

The Biology Some definitions Motivation The system Results

The Biology

DNA – Contains all of our genetic

information.– This information is coded to

proteins.– Double strand.

The Biology

RNA– Involved in the protein

production process.– Single strand.– Folds onto itself.– New studies show that RNA

is involved in many regulatory processes.

Protein Synthesis

DNA RNA Protein

3 types of RNA molecules involved:

• mRNA – carries the genetic information.

• rRNA – enables protein production.

• tRNA – transforms codons to amino acids.

RNA – New Roles Revealed

Layout

The Biology Some definitions Motivation The system Results

RNA Secondary Structure Representation

The secondary structure describes the general three-dimensional form of local segments of RNA molecules.

It is formally defined by the hydrogen bonds of the biopolymer, as observed in an atomic-resolution structure.

The secondary structure of RNA molecules can be predicted computationally by calculating the minimum free energies (MFE) structure for all different combinations of hydrogen bonding and domains.

RNAsubopt

RNAsubopt predicts all sub-optimal secondary structures of a given sequence.

RNAsubopt’s output includes a dot-bracket notation and free energy value for each of the structures.

Structural Motif of RNA

A two-dimensional structural element or fold within the chain, which appears also in a variety of other molecules.

Bulge

Loop

Hairpin

Motif

h5 – a complementary segment from 5’ direction (opening a stem).

h3 – a complementary segment from 3’ direction (closing a stem).

ss – a non-complementary segment (x:y) – length constraints of an element (x-

min, y-max).

Structural Motif of RNA

h5(2:7) ss(2:2) h5(4:8) ss(6:8) h3(4:8) ss(1:3) h3(2:7)

Layout

The Biology Some definitions Motivation The system Results

Motivation

For many RNA molecules, the secondary structure is highly important to the correct function of the RNA — often more so than the actual sequence.

This structure is also more conserved through evolution than the sequence.

The importance of the secondary structure presents a need for tools that rely on searching for common structures rather than searching for a common alignment of sequences.

Motivation

Finding common motifs can:

• Help us classify RNA families.• Find new members of a known family.• Provide information on RNA functionality.

Layout

The Biology Some definitions Motivation The system Results

System Outline

System Parameters

Input

A data set containing RNA sequences that are suspected to share a common motif

For example:

Creating Subsequences - Segments

Because we are interested in a motif and not in a global alignment, we break down every sequence into segments of variable lengths and start indexes.

AUCGAUGCUACGUAUAGC

Folding the segments

A query is sent to RNAsubopt which is

capable of predicting all sub-optimal secondary structures for every subsequence.

Each sub-optimal solution is then converted to a more condensed string.

((((((((….)))))))) h5(8) ss(4) h3(8)

Input 4D array of structures

System Outline

GP

You probably know GP.

Representation. Function and terminal sets. Genetic operators. Fitness function.

GP Simulation

We used ECJ, which is a freely available evolutionary computation and genetic programming system written in Java.

It is highly flexible and nearly all classes and settings can be changed by the user.

We used it to run the evolutionary simulation, and

added classes to support our representation, fitness pre-processing and main algorithm.

Representation

h5 ss h5 ss h3 ss h3 ss h5 ss h3

Representation (2)

h5 ss h5 ss h3 ss h3 ss h5 ss h3 h5(4:5) ss(3:3) h5(3:4) ss(6:7) h3(3:4) ss(0:2)

h3(4:5) ss(2:5) h5(2:4) ss(4:4) h3(2:4) Matching h5 and h3 always have the same

length constraints (because they are matching), so we don’t have to represent them specifically.

However, we should remember the hierarchy.

Representation (3)

The ss elements cannot contain sub-elements

ss tree can only have a “next” son.

Enclosed Next

Representation (4)

Function set:– h5– ss– Length, minimum and maximum– Numerical values

Terminal set:– Ending elements– Numerical values

Representation (5) - basic trees

The tree is made up of two types of basic trees:

Representation (6) - Example

H5(3:6) ss(5:9) h3(3:6) ss(1:3)

Representation (7)

Note that numerical values are considered functions as well as terminals, since an integer number is represented as the sum of values in the binary tree of small numbers (0-3).

This was done to increase sensitivity and diversity (for example, a mutation of such a value will change a sub-tree and not the whole number).

GP Parameters

Fitness

The fitness function assigns a score to an individual based on how good it solves the problem.

Our goal is to find the motif that is the most common and preferably has the most elements.

At the end of the evolutionary run the individual with the highest fitness encountered throughout the run is returned as the result.

Fitness

The motif has to:– Be common to as many sequences as possible.– Comply with the system parameters (mainly the

maximum number of helixes allowed).– Be energetically favorable.– Not be very simple (e.g., single strand only)

Finding an occurrence

Finding a motif usually requires aligning it with the sequence in a manner that results in the best score and includes spaces and gaps.

Finding an occurrence

We use a different approach. Recall that we broke down every sequence into

segments of variable lengths. Given a motif, we take a subset of segments in

the range of the motif minimum and maximum lengths.

We then iteratively go through this subset to find segments that fold under the motif constraints.

Finding an occurrence

When a motif is being compared to the segmentthere is no need for scanning or aligning—we simply compare the corresponding elements of the motifand the compressed string representing the segment.

When elements do not match, we move on to thenext sub-optimal structure of the segment or to the next segment of the sequence, until a match is foundor until there are no more segments.

Finding an occurrence

Motif : ss(0:5) h5(4:6) ss(2:6) h3(4:6) ss(1:2)

Sequence i: AAAGUUCCAGGAAGUGACUUGCUGCGACAGUGCUCGUGUAG

( ( ( ( . . ( . . . . . ) . . ) ) ) )

h5(4) ss(2) h5(1) ss(5) h3(1) ss(2) h3(4) ss(1)

Finding an occurrence

Motif : ss(0:5) h5(4:6) ss(2:6) h3(4:6) ss(1:2)

Sequence i: AAAGUUCCAGGAAGUGACUUGCUGCGACAGUGCUCGUGUAG

. . . . . ( ( ( ( . . . . . . ) ) ) ) .

ss(5) h5(4) ss(6) h3(4) ss(1)

Fitness

Fitness: Main Score

The index of the

sequence.

Total number of sequences in the data set.

Length of segment checked (the l’th

suboptimal solution of a segment in position

k and length j of sequence i.( Number of

elements in a motif.

Number of complementary segments in a

motifReflects the generality of

the motif.

The most meaningful

occurrence of the motif in sequence i

motif

Fitness: Major Score

We define the most meaningful occurrence

as the longest subsequence of sequence i that folds according to the motif constraints, with the most negative free energy.

Special cases, like a motif that is only a single-strand or a very general motif, are given a penalty.

Other occurrences are less important (we must have at least one). This score is counted in “minor”.

Fitness: Generality Score

The user can specify the level of generality, and according to the level chosen, a different penalty function is calculated.

There are four levels: 1) maximum generality: no penalty is given.2) moderate generality the penalty is calculated based on the differences between the motif’s range and the lengths of the sequences in which it occurs:

2.1: by calculating the minimum and maximum lengths over all of the occurrences.

2.2: by calculating the average length over all of the occurrences.

4) High specificity: the penalty is calculated based on the lengths of the elements in the motif.

Fitness

All non-major occurrences

of the motif in seq. i

Length of the longest consecutive segment of

the motif that is contained in sequence

i.

Total number of segments of all lengths and all

indexes of a sequence.

Fitness: Supplements Score

Supplements score is important in the early stages of the evolutionary run, where it is unlikely that randomly generated motifs will be common to all of the sequences in the data set.

It prevents the situation where most of the individuals in the population receive zero fitness, thereby hampering evolution.

Layout

The Biology Some definitions Motivation The system Results

Results

IRE SRP Micro RNA

GPRM

A genetic programming-based method that predicts a common motif while folding the input sequences.

Uses a naïve (dynamic programming) approach to fold the sequences.

Ferritin and IRE

Ferritin is an iron-binding protein that prevents toxic levels of ionized iron (Fe2+) from building up in cells.

Ferritin regulation occurs through the action of an iron response element binding protein (IRBP) that binds to specific iron response elements (IREs) in the 5'-UTR of the ferritin mRNA. These IREs form hair-pin loop structures that are recognized by IRBP.

Ferritin and IRE

When iron levels are low, IRBP is free of iron and can therefore, interact with the IREs in the 5'-UTR of the Ferritin receptor mRNA, bind to it and prevent ferritin translation.

When iron levels are high, IRBP cannot bind to the IRE in the 5'-UTR of the ferritin mRNA. This allows the ferritin mRNA to be translated and reduce toxic iron levels.

Ferritin and IRE

DNA mRNA Ferritin

IRBP

FerritinIRE

Fe2+

IRE

This data set was used in RNAProfile in the discovery test. It was built by retrieving the full mRNA sequences of human and mouse ferritin.

In addition, the mRNA sequences of three human ferritin pseudogenes were added to the data set

The sequence lengths ranged between 800-2000 nt.

IRE

The best motif predicted by GeRNAMo is: ss(0:1) h5(3:4) ss(0:1) h5(4:5) ss(3:6) h3(4:5) h3(3:4) ss(0:1)

GeRNAMo did not find occurrences of the predicted motif in the pseudogenes because of the specificity of the motif.

This is a reasonable result because a functional IRE motif may not occur in these particular pseudogene sequences.

IRE

IRE- GPRM

Because GPRM only accepts sequences with length smaller than 1000 nt., we shortened the original input at the end of each sequence at a position far beyond the occurrence of the motif (each sequence still contained the motif), and removed the 3 pseudogenes.

Therefore, the input given to GPRM was less complicated (shorter, no noise from the pseudogenes).

IRE - GPRM

SRP – Signal Recognition Particle

A protein-RNA complex that recognizes and transports specific proteins to the endoplasmic reticulum in eukaryotes and the plasma membrane in prokaryotes. The eukaryotic SRP is composed of six distinct polypeptides bound to an RNA molecule.

SRP

This data set was used in experiment 5 in the work of Fogel et al. It contains five full-length sequences for 4.5S/7S rRNA

The length of the sequences ranged between 250 and 700 nt.

SRP

The problem addressed by Fogel et al. is different than ours.

Fogel et al. use a pre-designed motif (called descriptor) to screen a data base of sequences for the existence of the given motif.

This method does not perform any prediction, whereas GeRNAMo predicts the common motif based on the sequences in the data set.

SRP

To test our approach we used GeRNAMo to discover the common motif in the data set, and compared it to the descriptor that was used by Fogel et al..

The goal was to use GeRNAMo to predict the descriptor used by Fogel et al.—a pre-designed motif—without any prior knowledge.

SRP

The descriptor that was used in Fogel et al.: h5(3:3) ss(3:5) h5(3:3) ss(4:6) h3(3:3) ss(3:5) h3(3:3)

The best motif predicted by GeRNAMo is:h5(3:4) ss(4:5) h5(3:4) ss(4:5) h3(3:4) ss(4:5) h3(3:4)

SRP – Fogel and GeRNAMo

SRP – GPRM results

Micro RNA

A single-stranded RNA molecules of about 21-23 nucleotides thought to regulate the expression of other genes.

miRNAs are encoded by RNA genes that are transcribed from DNA but not translated into protein. Instead they are processed from primary transcripts known as pri-miRNA to short stem-loop structures called pre-miRNA and finally to functional miRNA.

Micro RNA (2)

Mature miRNA molecules are complementary to regions in one or more messenger RNA (mRNA) molecules, which they target for degradation.

microRNA

The data set contains 25 microRNA precursor sequences from the human genome.

Sanger microRNA database was searched for humanmicroRNA precursors. The results were reduced to 25 precursors containing 4 double-stranded elements.

Each sequence in the data set is the microRNA surrounded by its flanking genomic sequence.

Although containing the same number of complementary segments, the range of the elements’ lengths is very large, a fact that leads to a general common motif.

This was done in order to make it difficult for GeRNAMo to find a common motif.

microRNA

microRNA

The motif of the data set is:ss(0:4) h5(2:15) ss(0:4) h5(1:24) ss(0:3) h5(1:23) ss(0:3) h5(2:11) ss(3:14) h3(2:11) ss(0:9) h3(1:23)ss(0:3) h3(1:24) ss(0:5) h3(2:15) ss(0:5).

The best motif predicted by GeRNAMo is:ss(0:16) h5(1:15) ss(0:5) h5(2:17) ss(1:3) h5(3:23) ss(0:16) h5(3:11) ss(3:14) h3(3:11) ss(0:6) h3(3:23) ss(0:6) h3(2:17) ss(0:6) h3(1:15) ss(0:20).

microRNA

Out of the 25 sequences in the data set, this motif occurs in 20 sequences at the correct positions. The

reason that the motif does not occur in all of the sequences is that this data set contains sequences with a very wide range of lengths, and GeRNAMo tries to reduce the generality of the motif (general individuals are penalized by the fitness function).

As a consequence, it misses some marginal sequences with extreme length values.

microRNA

When we reduced the penalty for the motif’s generality, GeRNAMo predicted a more general motif that occurred correctly in all 25 sequences.

In the reduced-penalty case, GeRNAMo predicted the correct motif:

ss(0:15) h5(1:21) ss(0:15) h5(2:25) ss(0:13) h5(0:25) ss(0:16) h5(0:13) ss(2:24) h3(0:13) ss(0:22) h3(2:25) ss(0:15) h3(2:25) ss(0:10) h3(1:21) ss(0:20).

This motif retains the same secondary structure as the more specific motif, but it varies in the length ranges of its elements, thus allowing it to occur at the correct position in all of the sequences in the data set.

microRNA - GPRM

We submitted this data set to GPRM, trying several parameter combinations to achieve good performance.

Two of the motifs that were generated by GPRM were:

h5(3:20) ss(2:12) h5(9:20) ss(0:14) h3(3:20) ss(0:15) h5(3:7) ss(0:13) h5(3:6) ss(0:13) h3(9:20)ss(0:7) h3(3:7) ss(0:6) h3(3:6).

h5(6:14) ss(0:19) h3(6:14) ss(0:15) h5(16:35)ss(0:20) h5(6:13) ss(0:16) h3(16:35) ss(0:5)h5(5:14) ss(6:15) h3(6:13) ss(0:2) h3(5:14).

microRNA - GPRM

These motifs represent a pseudoknot secondary structure, which is a false positive.

These results show that GPRM fails to detect a common motif in the microRNA data set.

microRNA - RNAProfile

In addition, we submitted this data set to RNAProfile. We used several sets of parameters which included

the default parameters and length parameters that comply with the microRNA motif length (maximum 100 nt.).

RNAProfile was not successful at building a profile of the microRNA motif, and the profiles it returned were of much smaller length.

Future work

Supporting pseudoknots Using the Output as Input Co-evolution

Supporting Pseudoknots

Our representation does not support pseudoknots, because in the GP tree, a sub-motif that starts with “h5” must end with a matching “h3”, thereby restricting the number of possible combinations that allow the creation of pseudoknots motifs.

In order to allow more combinations in the tree, the GP tree representation should be modified.

Supporting Pseudoknots

The new representation is simpler. An individual is a tree, made of these 3 basic trees:

– “h5”– “h3”– “ss”

Each basic tree can have 2 children - sub motifs. The conversion from the tree to a motif is simply

an inorder traversal of the tree.

Supporting Pseudoknots

Supporting Pseudoknots

h5(5:7) ss(1:2) h5(4:5) h3(5:7) ss(1:1) h3(4:5)

Supporting Pseudoknots

The new representation does not have the limitations of the previous representation and allows more freedom and versatility in the GP tree.

However, it requires more legality checks (that were unnecessary in the original representation).

Using the Output as Input

In our current implementation, the initial population in the evolutionary run is random.

We suggest that this population will be based on previous simulations.

The idea is to run the simulation for several times with different sets of parameters in parallel. The output of these runs will be common motifs. The next stage will then be to use these motifs as a basis for a new population of common motifs. In such a setting, the evolution will try to evolve the best motif (the common motif) based on already common motifs, with the idea that this evolved motif can become even better and more common.