finding a common motif of rna sequences using genetic programming the gernamo system shahar michal,...
Post on 21-Dec-2015
215 views
TRANSCRIPT
Finding a Common Motif of RNA Sequences
Using Genetic Programming
The GeRNAMo System
Shahar Michal, Tor Ivry, Omer S. Cohen, Moshe Sipper, and Danny Barash
GeRNAMo
Genetic programming of RNA Motifs. An evolutionary computation system that
searches for a motif which is common to a given data set of RNA sequences.
Geronimo
Goyalkla, or "One Who Yawns," is known to most non-Indians by his Spanish nickname, Geronimo, the greatest Apache warrior of his time.
When he was a young man, Mexican soldiers had murdered his wife and children during a brutal attack on his village in Chihuahua, Mexico. Though Geronimo later remarried and fathered other children, the scars of that early tragedy left him with an abiding hatred for Mexicans.
Operating in the border region around Mexico's Sierra Madre and southern Arizona and New Mexico, Geronimo and his band of 50 Apache warriors succeeded in keeping white settlers off Apache lands for decades. Geronimo never learned to use a gun, yet he armed his men with the best modern rifles he could obtain and used field glasses to aid reconnaissance during his campaigns. He was a brilliant strategist who used the Apache knowledge of the arid desert environment to his advantage, and for years Geronimo and his men successfully evaded two of the US Army's most talented Indian fighters, General George Crook and General Nelson A. Miles.
The Biology
DNA – Contains all of our genetic
information.– This information is coded to
proteins.– Double strand.
The Biology
RNA– Involved in the protein
production process.– Single strand.– Folds onto itself.– New studies show that RNA
is involved in many regulatory processes.
Protein Synthesis
DNA RNA Protein
3 types of RNA molecules involved:
• mRNA – carries the genetic information.
• rRNA – enables protein production.
• tRNA – transforms codons to amino acids.
RNA Secondary Structure Representation
The secondary structure describes the general three-dimensional form of local segments of RNA molecules.
It is formally defined by the hydrogen bonds of the biopolymer, as observed in an atomic-resolution structure.
The secondary structure of RNA molecules can be predicted computationally by calculating the minimum free energies (MFE) structure for all different combinations of hydrogen bonding and domains.
RNAsubopt
RNAsubopt predicts all sub-optimal secondary structures of a given sequence.
RNAsubopt’s output includes a dot-bracket notation and free energy value for each of the structures.
Structural Motif of RNA
A two-dimensional structural element or fold within the chain, which appears also in a variety of other molecules.
Bulge
Loop
Hairpin
Motif
h5 – a complementary segment from 5’ direction (opening a stem).
h3 – a complementary segment from 3’ direction (closing a stem).
ss – a non-complementary segment (x:y) – length constraints of an element (x-
min, y-max).
Motivation
For many RNA molecules, the secondary structure is highly important to the correct function of the RNA — often more so than the actual sequence.
This structure is also more conserved through evolution than the sequence.
The importance of the secondary structure presents a need for tools that rely on searching for common structures rather than searching for a common alignment of sequences.
Motivation
Finding common motifs can:
• Help us classify RNA families.• Find new members of a known family.• Provide information on RNA functionality.
Creating Subsequences - Segments
Because we are interested in a motif and not in a global alignment, we break down every sequence into segments of variable lengths and start indexes.
AUCGAUGCUACGUAUAGC
Folding the segments
A query is sent to RNAsubopt which is
capable of predicting all sub-optimal secondary structures for every subsequence.
Each sub-optimal solution is then converted to a more condensed string.
((((((((….)))))))) h5(8) ss(4) h3(8)
GP
You probably know GP.
Representation. Function and terminal sets. Genetic operators. Fitness function.
GP Simulation
We used ECJ, which is a freely available evolutionary computation and genetic programming system written in Java.
It is highly flexible and nearly all classes and settings can be changed by the user.
We used it to run the evolutionary simulation, and
added classes to support our representation, fitness pre-processing and main algorithm.
Representation (2)
h5 ss h5 ss h3 ss h3 ss h5 ss h3 h5(4:5) ss(3:3) h5(3:4) ss(6:7) h3(3:4) ss(0:2)
h3(4:5) ss(2:5) h5(2:4) ss(4:4) h3(2:4) Matching h5 and h3 always have the same
length constraints (because they are matching), so we don’t have to represent them specifically.
However, we should remember the hierarchy.
Representation (3)
The ss elements cannot contain sub-elements
ss tree can only have a “next” son.
Enclosed Next
Representation (4)
Function set:– h5– ss– Length, minimum and maximum– Numerical values
Terminal set:– Ending elements– Numerical values
Representation (7)
Note that numerical values are considered functions as well as terminals, since an integer number is represented as the sum of values in the binary tree of small numbers (0-3).
This was done to increase sensitivity and diversity (for example, a mutation of such a value will change a sub-tree and not the whole number).
Fitness
The fitness function assigns a score to an individual based on how good it solves the problem.
Our goal is to find the motif that is the most common and preferably has the most elements.
At the end of the evolutionary run the individual with the highest fitness encountered throughout the run is returned as the result.
Fitness
The motif has to:– Be common to as many sequences as possible.– Comply with the system parameters (mainly the
maximum number of helixes allowed).– Be energetically favorable.– Not be very simple (e.g., single strand only)
Finding an occurrence
Finding a motif usually requires aligning it with the sequence in a manner that results in the best score and includes spaces and gaps.
Finding an occurrence
We use a different approach. Recall that we broke down every sequence into
segments of variable lengths. Given a motif, we take a subset of segments in
the range of the motif minimum and maximum lengths.
We then iteratively go through this subset to find segments that fold under the motif constraints.
Finding an occurrence
When a motif is being compared to the segmentthere is no need for scanning or aligning—we simply compare the corresponding elements of the motifand the compressed string representing the segment.
When elements do not match, we move on to thenext sub-optimal structure of the segment or to the next segment of the sequence, until a match is foundor until there are no more segments.
Finding an occurrence
Motif : ss(0:5) h5(4:6) ss(2:6) h3(4:6) ss(1:2)
Sequence i: AAAGUUCCAGGAAGUGACUUGCUGCGACAGUGCUCGUGUAG
( ( ( ( . . ( . . . . . ) . . ) ) ) )
h5(4) ss(2) h5(1) ss(5) h3(1) ss(2) h3(4) ss(1)
Finding an occurrence
Motif : ss(0:5) h5(4:6) ss(2:6) h3(4:6) ss(1:2)
Sequence i: AAAGUUCCAGGAAGUGACUUGCUGCGACAGUGCUCGUGUAG
. . . . . ( ( ( ( . . . . . . ) ) ) ) .
ss(5) h5(4) ss(6) h3(4) ss(1)
Fitness: Main Score
The index of the
sequence.
Total number of sequences in the data set.
Length of segment checked (the l’th
suboptimal solution of a segment in position
k and length j of sequence i.( Number of
elements in a motif.
Number of complementary segments in a
motifReflects the generality of
the motif.
The most meaningful
occurrence of the motif in sequence i
motif
Fitness: Major Score
We define the most meaningful occurrence
as the longest subsequence of sequence i that folds according to the motif constraints, with the most negative free energy.
Special cases, like a motif that is only a single-strand or a very general motif, are given a penalty.
Other occurrences are less important (we must have at least one). This score is counted in “minor”.
Fitness: Generality Score
The user can specify the level of generality, and according to the level chosen, a different penalty function is calculated.
There are four levels: 1) maximum generality: no penalty is given.2) moderate generality the penalty is calculated based on the differences between the motif’s range and the lengths of the sequences in which it occurs:
2.1: by calculating the minimum and maximum lengths over all of the occurrences.
2.2: by calculating the average length over all of the occurrences.
4) High specificity: the penalty is calculated based on the lengths of the elements in the motif.
Fitness
All non-major occurrences
of the motif in seq. i
Length of the longest consecutive segment of
the motif that is contained in sequence
i.
Total number of segments of all lengths and all
indexes of a sequence.
Fitness: Supplements Score
Supplements score is important in the early stages of the evolutionary run, where it is unlikely that randomly generated motifs will be common to all of the sequences in the data set.
It prevents the situation where most of the individuals in the population receive zero fitness, thereby hampering evolution.
GPRM
A genetic programming-based method that predicts a common motif while folding the input sequences.
Uses a naïve (dynamic programming) approach to fold the sequences.
Ferritin and IRE
Ferritin is an iron-binding protein that prevents toxic levels of ionized iron (Fe2+) from building up in cells.
Ferritin regulation occurs through the action of an iron response element binding protein (IRBP) that binds to specific iron response elements (IREs) in the 5'-UTR of the ferritin mRNA. These IREs form hair-pin loop structures that are recognized by IRBP.
Ferritin and IRE
When iron levels are low, IRBP is free of iron and can therefore, interact with the IREs in the 5'-UTR of the Ferritin receptor mRNA, bind to it and prevent ferritin translation.
When iron levels are high, IRBP cannot bind to the IRE in the 5'-UTR of the ferritin mRNA. This allows the ferritin mRNA to be translated and reduce toxic iron levels.
IRE
This data set was used in RNAProfile in the discovery test. It was built by retrieving the full mRNA sequences of human and mouse ferritin.
In addition, the mRNA sequences of three human ferritin pseudogenes were added to the data set
The sequence lengths ranged between 800-2000 nt.
IRE
The best motif predicted by GeRNAMo is: ss(0:1) h5(3:4) ss(0:1) h5(4:5) ss(3:6) h3(4:5) h3(3:4) ss(0:1)
GeRNAMo did not find occurrences of the predicted motif in the pseudogenes because of the specificity of the motif.
This is a reasonable result because a functional IRE motif may not occur in these particular pseudogene sequences.
IRE- GPRM
Because GPRM only accepts sequences with length smaller than 1000 nt., we shortened the original input at the end of each sequence at a position far beyond the occurrence of the motif (each sequence still contained the motif), and removed the 3 pseudogenes.
Therefore, the input given to GPRM was less complicated (shorter, no noise from the pseudogenes).
SRP – Signal Recognition Particle
A protein-RNA complex that recognizes and transports specific proteins to the endoplasmic reticulum in eukaryotes and the plasma membrane in prokaryotes. The eukaryotic SRP is composed of six distinct polypeptides bound to an RNA molecule.
SRP
This data set was used in experiment 5 in the work of Fogel et al. It contains five full-length sequences for 4.5S/7S rRNA
The length of the sequences ranged between 250 and 700 nt.
SRP
The problem addressed by Fogel et al. is different than ours.
Fogel et al. use a pre-designed motif (called descriptor) to screen a data base of sequences for the existence of the given motif.
This method does not perform any prediction, whereas GeRNAMo predicts the common motif based on the sequences in the data set.
SRP
To test our approach we used GeRNAMo to discover the common motif in the data set, and compared it to the descriptor that was used by Fogel et al..
The goal was to use GeRNAMo to predict the descriptor used by Fogel et al.—a pre-designed motif—without any prior knowledge.
SRP
The descriptor that was used in Fogel et al.: h5(3:3) ss(3:5) h5(3:3) ss(4:6) h3(3:3) ss(3:5) h3(3:3)
The best motif predicted by GeRNAMo is:h5(3:4) ss(4:5) h5(3:4) ss(4:5) h3(3:4) ss(4:5) h3(3:4)
Micro RNA
A single-stranded RNA molecules of about 21-23 nucleotides thought to regulate the expression of other genes.
miRNAs are encoded by RNA genes that are transcribed from DNA but not translated into protein. Instead they are processed from primary transcripts known as pri-miRNA to short stem-loop structures called pre-miRNA and finally to functional miRNA.
Micro RNA (2)
Mature miRNA molecules are complementary to regions in one or more messenger RNA (mRNA) molecules, which they target for degradation.
microRNA
The data set contains 25 microRNA precursor sequences from the human genome.
Sanger microRNA database was searched for humanmicroRNA precursors. The results were reduced to 25 precursors containing 4 double-stranded elements.
Each sequence in the data set is the microRNA surrounded by its flanking genomic sequence.
Although containing the same number of complementary segments, the range of the elements’ lengths is very large, a fact that leads to a general common motif.
This was done in order to make it difficult for GeRNAMo to find a common motif.
microRNA
The motif of the data set is:ss(0:4) h5(2:15) ss(0:4) h5(1:24) ss(0:3) h5(1:23) ss(0:3) h5(2:11) ss(3:14) h3(2:11) ss(0:9) h3(1:23)ss(0:3) h3(1:24) ss(0:5) h3(2:15) ss(0:5).
The best motif predicted by GeRNAMo is:ss(0:16) h5(1:15) ss(0:5) h5(2:17) ss(1:3) h5(3:23) ss(0:16) h5(3:11) ss(3:14) h3(3:11) ss(0:6) h3(3:23) ss(0:6) h3(2:17) ss(0:6) h3(1:15) ss(0:20).
microRNA
Out of the 25 sequences in the data set, this motif occurs in 20 sequences at the correct positions. The
reason that the motif does not occur in all of the sequences is that this data set contains sequences with a very wide range of lengths, and GeRNAMo tries to reduce the generality of the motif (general individuals are penalized by the fitness function).
As a consequence, it misses some marginal sequences with extreme length values.
microRNA
When we reduced the penalty for the motif’s generality, GeRNAMo predicted a more general motif that occurred correctly in all 25 sequences.
In the reduced-penalty case, GeRNAMo predicted the correct motif:
ss(0:15) h5(1:21) ss(0:15) h5(2:25) ss(0:13) h5(0:25) ss(0:16) h5(0:13) ss(2:24) h3(0:13) ss(0:22) h3(2:25) ss(0:15) h3(2:25) ss(0:10) h3(1:21) ss(0:20).
This motif retains the same secondary structure as the more specific motif, but it varies in the length ranges of its elements, thus allowing it to occur at the correct position in all of the sequences in the data set.
microRNA - GPRM
We submitted this data set to GPRM, trying several parameter combinations to achieve good performance.
Two of the motifs that were generated by GPRM were:
h5(3:20) ss(2:12) h5(9:20) ss(0:14) h3(3:20) ss(0:15) h5(3:7) ss(0:13) h5(3:6) ss(0:13) h3(9:20)ss(0:7) h3(3:7) ss(0:6) h3(3:6).
h5(6:14) ss(0:19) h3(6:14) ss(0:15) h5(16:35)ss(0:20) h5(6:13) ss(0:16) h3(16:35) ss(0:5)h5(5:14) ss(6:15) h3(6:13) ss(0:2) h3(5:14).
microRNA - GPRM
These motifs represent a pseudoknot secondary structure, which is a false positive.
These results show that GPRM fails to detect a common motif in the microRNA data set.
microRNA - RNAProfile
In addition, we submitted this data set to RNAProfile. We used several sets of parameters which included
the default parameters and length parameters that comply with the microRNA motif length (maximum 100 nt.).
RNAProfile was not successful at building a profile of the microRNA motif, and the profiles it returned were of much smaller length.
Supporting Pseudoknots
Our representation does not support pseudoknots, because in the GP tree, a sub-motif that starts with “h5” must end with a matching “h3”, thereby restricting the number of possible combinations that allow the creation of pseudoknots motifs.
In order to allow more combinations in the tree, the GP tree representation should be modified.
Supporting Pseudoknots
The new representation is simpler. An individual is a tree, made of these 3 basic trees:
– “h5”– “h3”– “ss”
Each basic tree can have 2 children - sub motifs. The conversion from the tree to a motif is simply
an inorder traversal of the tree.
Supporting Pseudoknots
The new representation does not have the limitations of the previous representation and allows more freedom and versatility in the GP tree.
However, it requires more legality checks (that were unnecessary in the original representation).
Using the Output as Input
In our current implementation, the initial population in the evolutionary run is random.
We suggest that this population will be based on previous simulations.
The idea is to run the simulation for several times with different sets of parameters in parallel. The output of these runs will be common motifs. The next stage will then be to use these motifs as a basis for a new population of common motifs. In such a setting, the evolution will try to evolve the best motif (the common motif) based on already common motifs, with the idea that this evolved motif can become even better and more common.