the relative synonymous codon usage neural network · rescuenet was released as version 0.9 in...

The Relative Synonymous Codon Usage Neural Network

Codon Usage Analysis with the Self-Organizing Map Copyright 2002 Shaun Mahony RescueNet is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. RescueNet is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

Table of Contents 1. Introduction ……………………………………………………………….. 2 2. Getting the Software ………………………………………………………. 2 3. Generating RSCU Values ………………………………………………… 3 4. Training the SOM ………………………………………………………… 4 4.1 The Training Set ……………………………………………………. 4 4.2 SOM lattice size …………………………………………………….. 4 4.3 The Number of Epochs …………………………………………….. 5 4.4 Saving the SOM ……………………………………………………. 5 5. Analysing Data Using a Saved SOM …………………………………….. 6 5.1 Submenu Option 1: Generating Probability Scores for Genes ….. 6 5.1.1 Output file formats ………………………………………….. 6

5.1.2 Random Sequence Dataset …………………………………. 7 5.2 Submenu Option 2: Cosine Distribution Graphs ………………… 7 5.3 Submenu Option 3: Cluster Analysis ……………………………... 7 5.4 Significance of Difference Tests …………………………………… 8

6. RescueNet Command Line Arguments …………………………………. 9 7. Example A: Using RescueNet in Annotation ……………………………. 11 8. Example B: Analysis of codon usage variation ………………………….. 13 9. System Requirements & Performance Issues …………………………… 16 10. References ………………………………………………………………... 16

1

1. Introduction Most codons (with the exception of those coding Methionine and Tryptophan) have at least one synonymous alternative. Within the protein-coding regions of most sequenced genomes, the occurrence of synonymous codons does not appear to be random. In other words, genes seem to display a clear preference for one codon over a synonymous alternative. This preference is known as the synonymous codon usage pattern of a gene, and such patterns have been extensively studied 1-8. Studies on codon usage pattern conservation and variation have been used for some time in generating hypotheses of evolutionary relationships, predicting the expression levels of a gene, and in genome annotation. However, codon usage patterns vary not only from organism to organism, but also between genes in the same genome. Predicting how much variation in codon usage exists in a genome can be quite a difficult task. Current methods of studying codon usage variation take the form of multivariate analysis 9-11. This method represents codon usage information as points in multidimensional space. Codon usage trends in a genome are examined by reducing the dimensionality of the points and plotting the resulting cloud of 2-d points. However, dimensional reduction involves the loss of data, so multivariate analysis methods are only really effective at identifying broad trends in codon usage. In order to identify more subtle codon usage patterns, we have implemented a method based on a variant of the Self-Organizing Map (SOM) neural network algorithm 12. This method has the ability to automatically recognise common or repeated patterns in a dataset. This architecture consists of a two-dimensional output “lattice” of weight vectors. During training of the SOM, the weight vectors end up representing various popular patterns in the dataset, and similar patterns are clustered together in neighbouring areas of the output lattice. The result is that we can easily visualise variation within a dataset of patterns. Because of its unique implementation, RescueNet can be put to the following uses, which are difficult to achieve with traditional codon usage analysis methods:

• Identifying groups of genes that have similar codon usage patterns • Identifying genes that have atypical codon usage patterns in a genome • Identifying regions of a contiguous genomic sequence that have similar codon

usage patterns to major patterns in that genome. In this manual, Sections 7 & 8 give practical examples of using RescueNet in the above contexts. 2. Getting the software RescueNet is available from http://bioinf.nuigalway.ie/RescueNet. Pre-compiled binaries for a variety of systems are available from this page. Source code and compilation instructions are also available on request from the author. The software is free of charge for academic use and commercial users should contact the author.

2

http://bioinf.nuigalway.ie/RescueNet

RescueNet was released as version 0.9 in February 2003. RescueNet is currently in version 0.92 (May 2003). This is a beta version of RescueNet, therefore there may be bugs within the code. If you do find or notice anything strange please e-mail bugs / complaints / suggestions to [email protected] Remember to include an example of the input file (and output files) and the options selected that generated the error, don't forget to tell me the make of computer and operating system it was running under. I’m always interested in hearing feedback (positive or negative) from users of the program, so feel free to email [email protected] with any comments. 3. Generating RSCU Values In order to analyse the pattern of codon usage in a genome, the set of Relative Synonymous Codon Usage (RSCU) values are computed for each gene. The RSCU value for a codon ‘i’ is defined as follows:

where Obsi is the observed number of occurrences of codon i and Expi is the expected number of occurrences of the same codon (based on the number of times the relevant amino acid is present in the gene and the number of synonymous alternatives to i). In order to make the data more compatible with the mathematical methods used, the log10 of each RSCUi value was found so that the resulting value was positive if the codon was used more than expected in that gene, and negative if the codon was used less than expected. Taking the RSCU values for each of the codons with synonymous alternatives, each gene in the dataset is represented by a vector of 59 values.

>MG0001 DNA polymerase III beta sub (dnaN) ATGCGATACGACTACGAT GACTACGATACGATAGAT >MG0002 heat shock protein. GCAGATACAGATACGATA GCAATATATATCACATAT Figure 3.1: Example of FastA format files

Option 1 on the main menu allows the user to convert between FastA format files and RSCU. There is no defined limit to the size of input file that RescueNet can handle. FastA sequence data is the only acceptable form of input, and is defined as follows. On one line is the name of the gene (or organism), preceeded by a ‘>’ (no parentheses). The sequence data begins on the next line. The next sequence is identified by the ‘>’. If your data is in a different format, you can use Readseq by Don Gilbert to reformat the data into FastA format. Readseq is available from the IUBIO archive: http://iubio.bio.indiana.edu/ . The standard (Universal) Genetic Code is the default code used by the RSCU value generator. However, 10 other genetic codes are also supported. To select between these codes, enter the word ‘changecode’ instead of an input filename. The user will then be presented with a list of supported genetic codes (mostly mitochondrial) and asked to choose one. This will then remain the choice of genetic code until the program is terminated or until it is changed again. The output file from the RSCU value generator will hold (for each gene) the descriptor line taken from the FASTA file and a set of 59 RSCU values; one value for

3

mailto:[email protected]

mailto:[email protected]

http://iubio.bio.indiana.edu/

each variable codon. Within these numbers, positive values denote codons that are used more often than expected, and negative values denote codons that are used less often than expected. 4. Training a New SOM 4.1 The Training Set Before data may be analysed using RescueNet, a Self-Organizing Map must be trained using whatever sequence data the user deems appropriate. Option 2 from the main menu will begin the training process. The user will firstly be prompted for the training set filename. Only FASTA format sequence data or RSCU value files are acceptable input. Any incorrect filenames, or incorrectly formatted data will cause the program to exit. There is no limit to the amount of sequences that can be part of the training set, but obviously the larger the training set, the longer that training will take. The actual choice of training set is very dependant on the application for which RescueNet is being used. For example, a user may wish to train the SOM using only ORFs which are over a certain length. On the other hand, a user wishing to explore codon usage trends in an annotated genome may wish only to include genes of known function, as opposed to hypothetical genes. In order to go some way to facilitate this choice, the user is asked to choose between using;

0) All genes in the file 1) Only genes of known function (i.e. those without the words ‘putative’ or

‘hypothetical’ in their annotation) 2) Known function & putative genes (i.e. those without the word

‘hypothetical’ in their annotation) 3) All genes greater than a certain length in triplets

Note that options 1 & 2 depend on having an annotated dataset. Note also that option 3 will not find ORFs in a contiguous genome sequence. In this case, a dataset of ORFs may first be generated using NCBI’s ORF Finder (14) or a similar program. Finally, a user may manually construct a training set using whatever sequences they wish (e.g. mixing sequences from different genomes together, or only training the SOM on genes of specific function), as long as all sequences are in the same file. See the Sections 7 & 8 examples for more on training sets.

4.2 SOM lattice size RescueNet uses a square output lattice topology. During training, the user will be prompted for the length of one side of the lattice. Increasing the size will increase training time, so the user should be careful in the choice of lattice size. This choice will be governed by many influences, such as the problem being tackled, time constraints, and the size of the training set. For example, if RescueNet is being used as a visualisation tool for codon usage trends, the user may be tempted to choose a large sized lattice to give increased resolution. However, the user should be aware that a larger SOM that took longer to train will not necessarily lead to more accurate results. Too small a lattice may miss

4

some weak trends in codon usage, but too large a lattice may lead to an over-specialised SOM (which loses the ability to generalise about similar patterns in the data analysis phase). As a very rough estimate, I suggest that the total size of the lattice should be approximately the number of genes in the training set divided by 10 (and one side of the lattice should be the rounded off square root of this). The user is encouraged to experiment as much as possible with lattice size. 4.3 The Number of Epochs This is essentially the measure of how long the SOM should be trained for. This value is also one with which to experiment. Note, however, that a very short training time will lead to a SOM that will be inaccurate, and will not recognise subtle patterns, while a very long training time will also lead to an over-specialised SOM that may also be inaccurate. A value of between 500 and 3000 is recommended for most codon usage analysis applications, but the user may want to choose a lower number if SOM training time is becoming a bottleneck in an annotation process. Training of the SOM begins after the number of epochs are entered. 4.4 Saving the SOM Once training is complete, the SOM is automatically saved into a file with a default name beginning with the word “savedNet”. This file may be renamed and used at a later date for data analysis (see Section 5).

5

5. Analysing Data Using a Saved SOM A previously saved SOM may be loaded using option 3 under the main menu. Saved SOM files should be compatible across all supported platforms. SOMs that have been trained using a variety of organisms are available for download from the RescueNet website (http://bioinf.nuigalway.ie/RescueNet). It is possible to train a saved SOM further using the Continue Training option, but this option would result in reconfiguring a trained SOM, and it is difficult to foresee a use for this option. Option 2 in the submenu brings the user into the data analysis part of the program. The user will be prompted for the name of the RSCU file that holds the data to be tested. The user will then be asked if this dataset is to be split (according to the keywords ‘putative’ or ‘hypothetical’ in the gene’s annotation. The user may choose to split the dataset into genes of known function vs hypothetical genes in order to explicitly see the proportion of either subset that fits in well with the codon usage patterns of the genome (see Section 8). Once this choice is made, the Data Analysis Submenu is presented. 5.1 Submenu Option 1: Generating Probability Scores for Genes This option will use the SOM to assign a probability score to every gene in the dataset under test. This value will be the probability that the gene uses a similar codon usage pattern to one which the SOM is trained to recognise, as opposed to the likelihood that the gene uses a more random codon usage pattern. 5.1.1 Output file formats: The user is firstly prompted for an output filename. Even if the dataset has been separated after loading, all results will be stored in the same output file. The user is also prompted as to whether the results should be in short or long format. Short format results are unordered and consist of the first word in the gene’s annotation (usually an accession number) followed by the cosine distance score and the probability score on the same line (Fig 5.1). This format facilitates the use of spreadsheet programs for analysing the results in some applications (e.g. annotation) and fields are tab-delimited. The difference between cosine scores and probability scores is a subtle one. While cosine scores are an absolute score given by the SOM to a gene, probability scores are very dependant on the choice of random gene dataset (Section 5.1.2). The probability scores are in fact cosine scores that have been re-scaled from the mean score received by the random genes. In some cases, random genes can score very well (e.g. if codon usage in a genome is heavily governed by the mutational bias of that genome). In this case, the probability scores are almost useless and cosine scores should be used. That said,

Name Cosine Probability >NMB0001 0.4783384776 0.4013087793 >NMB0002 0.5420454638 0.7178544698 >NMB0003 0.9227227458 0.9999999825 >NMB0004 0.6769230883 0.9899956067 >NMB0005 0.7230620170 0.9982761770 >NMB0006 0.6423620159 0.9697983801 >NMB0007 0.9008533104 0.9999999158 >NMB0008 0.8569382606 0.9999984299 >NMB0009 0.6585980355 0.9816218580 >NMB0010 0.9017725348 0.9999999210 >NMB0011 0.7925211812 0.9999348213 >NMB0012 0.6894497823 0.9935891912 >NMB0013 0.7137530756 0.9974760497 >NMB0014 0.6485240598 0.9748718448 >NMB0015 0.7341625551 0.9989251228 >NMB0016 0.2294858076 0.0002523884 >NMB0017 0.7973053498 0.9999494309 >NMB0018 0.7874214275 0.9999149161 >NMB0019 0.7509637378 0.9994931153 >NMB0020 0.6840849962 0.9922203286 Figure 5.1: Short format results

6

http://bioinf.nuigalway.ie/RescueNet

probability scores are the best measure to take in most cases. See Section 7 for a graphic example of the difference between cosine scores and probability scores. Long format results (Fig. 5.2) are ordered with the highest scoring gene first and give more information about the gene, including the winning node, the cosine score, a z-score (based on the random dataset), as well as the probability score. A summary of results in the dataset is given at the end.

NMB1682 topoisomerase IV subunit B (parE) (Codons: 661 ) Winning Node: 3, 6 Cosine Score: 0.966658 Z-Score: 5.541406, Probability that pattern is not random: 0.9999999850 NMB2061 phosphoenolpyruvate carboxylase (ppc) (Codons: 900 ) Winning Node: 3, 6 Cosine Score: 0.966295 Z-Score: 5.535235, Probability that pattern is not random: 0.9999999845 NMB1849 carbamoyl-phosphate synthase (carA) (Codons: 377 ) Winning Node: 3, 8 Cosine Score: 0.965068 Z-Score: 5.514403, Probability that pattern is not random: 0.9999999825 Figure 5.2: Long format results

5.1.2 Random Sequence Dataset: Next, the user will be asked if they wish to change the dataset of random sequences that are used in probability score generation. If the user chooses to do so, s/he will be asked to choose between:

• a neutral nucleotide bias or a mutational bias • the mutational bias found in the genes in the file-under-test • a custom defined nucleotide bias

While the mutational bias option is the recommended one, the user should only choose this option if the current input file is the one which was used to train the SOM. If in doubt, it is safer to stick to the neutral bias option. The third option (defining your own bias) would be useful if the nucleotide percentages are available for the genome under test. This information is usually readily available. It should be noted that the probability scores will vary significantly between those generated using a neutral or mutational biased random dataset. Because of this, the probability scores should never be thought of as absolute scores, but more as a relative scale. No matter what random dataset is used, the order of the scores will not change (i.e. the highest scoring gene will always be the highest scoring gene), but the scores themselves will. As for the number of genes to be generated, 1000 to 2000 is a reasonable number, and gives a good statistical spread. Results will now be generated and the user will be returned to the Data Analysis Submenu. 5.2 Submenu Option 2: Cosine Distribution Graphs This option prints the distribution of cosine scores for a dataset to a file. If the user has split the dataset into known & hypothetical genes, then both distributions will be printed. If a dataset of random genes exists, then the distribution of their cosine scores is also printed. These graphs may then be viewed in a spreadsheet program. The use of this option is to allow the user to see what proportion of each displayed dataset scores well in comparison to the scores received by the random gene dataset. 5.3 Submenu Option 3: Cluster Analysis This option prints to a file clusters of similar nodes found on the SOM as well as how many genes in the dataset are recognised by each node. This option is very useful for identifying trends in codon usage in the dataset, because we can see exactly how many genes are grouped in each area of the SOM output layer. By splitting a dataset

7

into smaller subsets (according to function or chromosomal position) a map can be built up of the codon usage patterns used by each subgroup (see Section 8) After choosing this option, the user will firstly be asked if s/he wishes to change the threshold level at which node clusters are automatically generated. This option comes into use when the user wishes to explore the variation between the patterns the actual nodes have been trained to recognise. Moving this value towards 1 should show more node clusters on the output layer because a higher threshold value raises the level at which the clustering algorithm says that two nodes are similar to each other. Conversely, lowering the value towards 0 should show less clusters and the difference between each cluster will be more significant. The user is then asked which subgroup they wish to analyse. While this may seem like a repeat of the question asked when the data file was loaded, it actually refers to which subset of the data (if the data was earlier split) should be mapped in the output file. The user is also prompted for a name for the output file at this point. The output file will also contain a list of genes (highest scoring first) that are grouped with each node on the output layer. 5.4 Significance of Difference Tests Once clustering has been completed, the user will be asked if they wish to perform significance of difference tests between the node clusters. There are two types of SoD tests to choose between. The first compares every node on the output layer to every other, and so much data is produced in this way that this option is only of use in the case of a user trying to establish if two singular nodes are significantly different from each other. The second option is more useful. Entire clusters of nodes are compared to other clusters in order to establish if they use specific codons in a significantly different manner to the other clusters. In this case, the user can choose to either let RescueNet automatically identify clusters (according to the previously mentioned threshold level) or they can manually input lists of nodes as clusters. The second option is obviously more arduous, but it can be worthwhile in the case of users looking for differences in codon usage between groups of genes.

8

6. RescueNet Command Line Arguments Most of RescueNet’s functionality (with the exclusion of the Significance of Distance tests) can be performed through command line arguments. This facilitates the use of RescueNet as part of a batch job process. Batch Mode Arguments

RescueNet v0.91 (Beta testing) Batch Mode Arguments Syntax: RescueNet [-h] [-t trainSet [-sep X] [-size X] [-epochs Y] [-saveas S]] [-rscu inFile outFile] [-l SOMFile -in inputFile [-sep X] [-p resFile [-b bias] [-f format]] [-g graphFile] [-c clustFile [-thres t][-sub s]] ] [-annot contigFile -som SOMFile] [-b bias {mutFile}] [-ws size] [-wo offset] [-GFFscore score C/P] [-GFFlength len] -h Displays this help message -t [trainSet] Train a new SOM using genes in trainSet (FASTA or RSCU): Sub-options (not required): -sep X {y} Separate the dataset under test? 0: Do not separate the datasets 1: Separate known genes from others 2: Separate known and putative genes from others 3: Separate genes bigger than 'y' codons from others (default = separate known genes from others) -size X Size of one side of output layer (default = 10) -epochs Y Number of training cycles (default = 500) -saveas S Filename of saved SOM (default = savedNetXxX-Ye.som) -rscu [inFile outFile] RSCU Generator: generates RSCU values from the sequences in inFile and saves them in outFile. -l [SOMFile] Load an existing SOM from SOMFile and analyse the data in inFile -in [inputFile] Name of file to be analysed (required) This file can be a FASTA format file or a RSCU file Sub-options: -sep Separate the dataset under test? 0: Do not separate the datasets 1: Separate known genes from others 2: Separate known and putative genes from others (default = separate known genes from others) -p [resFile] Generate probability scores for the genes under test. The output is held in resFile. -b The bias of the random gene dataset N = no bias M = mutational bias found in known genes under test C = Custom bias; follow with the percentage of A,C,G,T X = use previous random gene file (default) -f The format of the results

9

S = short format L = Long format (full results - default) -g [graphFile] Generate probability distribution graphs The output is held in graphFile. -c [clustFile] Generate cluster maps of the SOM output layer The output is held in clustFile. Sub-options (not required): -thres Sets the threshold for clustering (default=0.9) -sub Defines subset of genes to test. K = Known gene set H = Hypothetical gene set A = All genes (Default = Real Genes) G-tests for clusters not available in Batch Mode -annot [contigFilename] Annotation Aid: analyses a contiguous genomic sequence using a SOM -som [SOMFilename] The name of the saved SOM -b The bias of the random gene dataset N = no bias M = mutational bias from file (must follow with filename) C = Custom bias; follow with the percentage of A,C,G,T X = use previous random gene file (default) -ws Window size for splitting the sequence (default=200 triplets) -wo Offset of windows (default=100 triplets) -GFFscore [score][C/P] Minimum score for a gene prediction. Follow the score (between 0 & 1) with either 'C' or 'P' C=Cosine score, P=Probability score. (default = 0.8 Probability score)

-GFFlength [len] Minimum length of a gene prediction.

10

7. Example A: Using RescueNet in Annotation The following example shows how RescueNet can be used as part of an annotation process to give an indication of where coding regions may exist. In this example, we are working with the unannotated, contiguous genome sequence of B. suis 13. Note that we did not rely in any way on BLAST searches or any gene prediction software other than RescueNet. In order to sample the prevalent codon usage patterns in the genome, we wish to train a SOM using a sample of genes from B. suis. This initial training set can come from the output of other gene prediction programs, or from a list of previously known genes in the organism, or even from a list of all ORFs existing in the genome. Within this set of genes, we are choosing to train the SOM using only those ORFs which are over 400 codons in length. These longer ORFs are more likely to be real genes, and should be representative of the codon usage patterns that exist in the coding regions of the genome. The sequences will become the training set for the SOM using the following command:

RescueNet –t Bsuis.samples –sep 3 400 –size 5 –epochs 500 –saveas som5x5.bin

Analysing this command, we see that “Bsuis.samples” is the FASTA format file that holds our training set of ORFs. The ‘-sep 3’ option tells the program only to use samples of length 400 codons or more. The SOM is set at being a 5x5 node SOM is being trained for 500 cycles, and saved as “som5x5.bin”. Once training is complete, we wish to analyse the contiguous genome sequence of B. suis using the SOM we have just trained. RescueNet does this by dividing the contiguous sequence up into smaller samples in all 6 reading frames, so the user must ensure that they have sufficient hard disk space for running this option (see System Requirements). Each sample is tested by the SOM and results are saved in tab-delimited files. The command used is:

RescueNet –annot B_suis_chr1 –som som5x5.bin –b M Bsuis.samples –ws 100 –wo 50 In this command, the file “B_suis_chr1” is the contiguous genome sequence of B. suis chromosome 1. We are loading the SOM (“som5x5.bin” that we trained in the previous step. The random dataset (see Section 5.1.2) is being generated from the SOM’s training set using the ‘-b M’ option. The contiguous sequence will be divided into samples of length 100 triplets (‘-ws’) and offset by 50 triplets from one another (‘-wo’). The program proceeds by producing 6 sequence files (one for each reading frame) and converting the samples into RSCU values. Note that these files are not automatically removed after the program exits. Each of the 6 samples files are then tested by the SOM and short format results (see Section 5.1.1) are generated and saved into 6 separate files (each ending in “res.txt”). As usual, these short format results are tab-delimited, and so are best opened in a spreadsheet program. Section 5.1.1 gave a description about the difference between Cosine and Probability scores. If we plot both scores over a short segment of the file, we can visualise the difference between the two scales. Figure 7.1 shows both scores plotted for the first 10,000bp in the first forward reading frame. You can see that the peaks are the same for both types of scores, but because probability scores are re-

11

scaled, it is much easier for a user to see high-scoring genomic areas using probability scores alone. So in the case of our B.suis analysis, the probability scores are the most effective measure to use. We can combine the probability scores columns from the 6 results files within a spreadsheet program, and graph sections of the data. A sample 25,000bp region is shown in figure 7.2. The published gene predictions from TIGR are shown for comparison. As may be seen from the sample, most high scoring regions correspond well to gene-coding regions. In all, approx. 90% of known-function B. suis genes are predicted correctly using our method. However, some short genes, and many ORFs annotated as ‘hypothetical proteins’ are not. This is understandable, given that there are many reasons for atypical codon usage.

Figure 7.1: The difference between Cosine and Probability Scores

Figure 7.2: Gene prediction in the B. suis genome (region 450Kbp-475Kbp) using a SOM trained on ORFs of over 1200bp in length. Probability scores for reading frames 4, 5 & 6 were multiplied by –1 for clarity. The yellow bars above and below the graph are regions that TIGR has annotated as being protein-coding regions.

Once all the samples are processed, some simple post-processing is carried out. Naturally, all same-frame concurrent predictions are merged. Predictions that are totally overlapped by another prediction are deleted if they are less than 75% the length of the other. Similarly, any prediction in which more than half its length is overlapped is deleted if it is less than half as long as the other prediction, or less than 90% as long as the other and receiving a lower score. Finally, any prediction that is overlapped on both ends to a total of at least 70% overlap is also deleted.

12

While these rules aim to delete smaller erroneous predictions, it is recognised that the loose rules leave room for many other overlaps. However, it was found that in many overlapping cases it was difficult to decide which prediction to delete. Therefore, the best solution is to leave both predictions rather than misleading an annotator by giving one possibly erroneous prediction. Note that from version 0.91 onwards, a General Feature Format (GFF) file is also generated in the annotation process. This file holds predictions of where genes lie in the contiguous sequence and is based on areas that hold scores higher than a certain threshold. The GFF file is saved with the extension “.gff” and can be viewed in sequence viewers such as Artemis. 8. Example B: Analysis of codon usage variation In this example, we wish to study the causes of codon usage variation in the genome of Neisseria meningitidis and in doing so, we will also identify which genes have a codon usage pattern which is atypical of the genome. Unlike the previous example, this example uses RescueNet’s interactive text-based menus, which can be run simply by calling RescueNet from the command-line. The first step is again to train a new SOM. We do this by choosing option 2 from the main menu. We have chosen to analyse the list of predicted genes downloaded from TIGR (15). Within this dataset, we then choose to train the SOM using only genes of known function. This leaves 1124 genes in the training-set from 2156 genes in the dataset. As shown in fig. 8.1, we then choose to train a 10x10 node SOM for 3000 epochs. Once training is complete, the SOM will be saved in a file named “savedNet10x10-3000e.som”. This is a default name and may be changed. We now find ourselves back at the main menu. This time, we choose Option 3 (Load a Saved SOM) and enter the name of our newly trained SOM. Next, we choose the option to “Analyse Data using SOM”, upon which we are asked to give the name of the file which contains the sequences we wish to analyse.

Figure 8.1: RescueNet interactive mode: Training

The first time we do this, we wish to identify the N. meningitidis genes that have an unusual codon usage pattern. Therefore we shall tell RescueNet to analyse the genes in GNM.seq, but we will split the dataset into genes of known function and hypothetical genes (see fig 8.2).

13

We are now presented with the Data Analysis Submenu. Using option 1 in this menu, we can now generate probability scores for each gene in the datasets, and thereby identify the genes that display an atypical codon usage pattern. When prompted to change the random genes, we choose to generate genes that use the same mutational bias as the genes. Long format results are chosen. The end of the results file is shown in fig. 8.3. From the results summary, we see that quite a high proportion of the hypothetical genes (23%) receive

scores below 0.1 from the SOM, and these genes (such as NMB1757 and NMB0819 shown in the figure) can be easily identified from the results file.

Figure 8.2: Data Analysis

The next analysis we wish to carry out is to visualise the amount of codon usage variation within the N. meningitidis genome. We should at this stage find ourselves back at the Data Analysis Submenu. Choosing option 3 here brings the user into the Cluster Analysis function (see Section 5.3). The user is firstly asked if they want to change the threshold value. At this stage, we do not wish to, and so enter ‘n’. We now choose to analyse only the genes of known function in the dataset. We do this in order to map the various groups of codon usage patterns that

exist in the genes of known function. After entering an output file name, the program maps the genes and since we do not wish to perform Significance of Difference tests between the groups, the Cluster Analysis function is now complete. Opening the Cluster Analysis output file, we see two separate 10x10 number maps (fig 8.4). The first of these maps shows the number of genes that group

14

NMB1757 hypothetical protein (Codons: 76 ) Winning Node: 0, 2 Cosine Score: 0.221433 Z-Score: -7.111150, Probability that pattern is not random: 0.0000000000 NMB0819 hypothetical protein (Codons: 130 ) Winning Node: 0, 3 Cosine Score: 0.174694 Z-Score: -7.904694, Probability that pattern is not random: 0.0000000000 *********** Overal Stats for Hypothetical Genes*********** 0.0 -> 0.1: 247 Genes 0.1 -> 0.2: 42 Genes 0.2 -> 0.3: 27 Genes 0.3 -> 0.4: 27 Genes 0.4 -> 0.5: 43 Genes 0.5 -> 0.6: 25 Genes 0.6 -> 0.7: 26 Genes 0.7 -> 0.8: 23 Genes 0.8 -> 0.9: 56 Genes 0.9 -> 1.0: 516 Genes Figure 8.3: Long format results file

Number of times each node responds to genes in current dataset: 28 11 12 22 6 5 13 6 16 10 9 9 7 10 7 12 20 12 8 11 10 9 12 9 7 7 11 6 7 3 10 9 9 13 11 18 19 10 16 14 10 11 5 17 12 12 14 13 11 13 9 13 15 13 10 8 16 20 11 13 11 17 17 15 9 8 7 8 13 12 9 8 10 18 26 8 9 11 9 12 10 8 13 11 8 4 9 12 10 7 10 11 8 14 10 8 13 11 6 15 Clusters that appear on the output layer: 0 1 2 3 4 5 6 7 6 6 8 9 10 11 12 13 14 14 6 6 15 16 17 18 19 20 20 14 14 21 22 22 22 23 20 20 20 20 14 20 24 22 25 20 20 20 20 20 20 26 27 22 22 20 20 20 20 20 20 20 22 22 22 20 20 20 20 20 20 20 28 29 30 20 20 20 20 20 20 20 31 20 20 20 20 32 20 20 20 33 34 35 20 20 20 20 20 36 37 38 Figure 8.4: Cluster Analysis results

with each SOM node. Displaying this map using a spreadsheet program or MatLab gives a representation like figure 8.5i. This will give the user an initial indication of the number of distinct clusters of codon usage that are occurring in the dataset. The Significance of Difference between these clusters may of course be tested at any time. The second map in the output file represents similarities found between actual nodes using the chosen threshold value. Each number represents a sufficiently different pattern for that threshold. Experimenting with the threshold value builds up a picture of which nodes are similar to other nodes on the output layer (figure 8.5ii). The whole process can be repeated using subsets of the dataset in order to test whether these subsets use similar codon usage patterns. In this way, we can discover the sources of codon usage variation within a genome. Choice of subset is up to the user. In our case, we chose to test separate functional groups in N. meningitidis. Lists of the genes in each subset were downloaded from TIGR (16). Using the cluster analysis function on the various subsets, we showed that to a certain extent, different functional groups have different codon usage patterns in N. meningitidis. Some examples are shown in figures 8.5.

Figure 8.5: Cluster Analysis in N. meningitidis (i) Nodes on the output layer responding to all known genes. Legend shows number of times

each node responds to the dataset. (ii) Groups of similar weight vectors on the output layer. The symbol ‘~’ denotes outlying nodes

that are dissimilar to their neighbours. (iii) Distribution of the energy metabolism functional group genes. (iv) Distribution of the protein synthesis functional group genes (v) Distribution of the cellular processes functional group genes (vi) Distribution of the high scoring hypothetical genes.

15

9. System Requirements & Performance Issues There are very few system requirements for RescueNet, aside from the obvious observation that a faster computer will bring down running time. RescueNet normally operates using less than 2MB of free RAM, but in the case of a user choosing to generate long format (ordered) results in the probability scores step (see Section 5.1), then memory requirements will be substantially higher. To order the results, RescueNet requires approximately 650bytes for every gene in the file under test (i.e: 650Kb free RAM for analysing 1000 genes). This only becomes a substantial figure when dealing with extremely large datasets, and even then should not cause any memory issues for the typical computer. The user should also be aware of free disk space requirements. This is especially an issue when using RescueNet in the annotation process. To find the required amount of disk space for an annotation run, the following formula may be used:

(window size*3) * (genome size in base-pairs)/(window offset * 3) This gives the number of bytes needed for one reading frame file. The figure should be multiplied by 6 to give the figure for all reading frames, and then multiplied by 4 to include the requirements of the RSCU and observed/expected files needed by the program. This will be a fairly substantial figure, but the actual results files only take up a small fraction of this space. I suggest deleting all files with extension “.seq”, “.rscu” and “.obsexp” after the program exits. These files are not deleted automatically by RescueNet in case the user wishes to inspect any of the intermediate sliding window sequences.

Running times vary not only from machine to machine, but also due to different program settings like training time, SOM size and dataset size. Some applications, such as integrating RescueNet into an annotation workbench, may be time-critical. As a guideline, such users should be aware that an entire annotation run (including SOM training) on a 2.5Mbp genome and using typical program settings takes about 5 minutes on a Pentium 4 PC. 10. References 1. Grantham, R., Gautier, C., Gouy, M., Mercier, R. & Pave, A. Codon catalog

usage and the genome hypothesis. Nucleic Acids Res 8, r49-r62 (1980). 2. Ikemura, T. Correlation between the Abundance of Escherichia coli Transfer

RNAs and the Occurance of the Respective Codons in its Protein Genes: A Proposal for a Synonymous Codon Choice that is Optimal for the E. coli Translational System. J. Mol. Biol. 151, 389-409 (1981).

3. Gouy, M. & Gautier, C. Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res 10, 7055-74 (1982).

4. Sharp, P. M., Averof, M., Lloyd, A. T., Matassi, G. & Peden, J. F. DNA sequence evolution: the sounds of silence. Philos Trans R Soc Lond B Biol Sci 349, 241-7 (1995).

5. Shields, D. C., Sharp, P. M., Higgins, D. G. & Wright, F. "Silent" sites in Drosophila genes are not neutral: evidence of selection among synonymous codons. Mol Biol Evol 5, 704-16 (1988).

6. Andersson, S. G. & Kurland, C. G. Codon preferences in free-living microorganisms. Microbiol Rev 54, 198-210 (1990).

16

7. Kanaya, S., Yamada, Y., Kudo, Y. & Ikemura, T. Studies of codon usage and tRNA genes of 18 unicellular organisms and quantification of Bacillus subtilis tRNAs: gene expression level and species-specific diversity of codon usage based on multivariate analysis. Gene 238, 143-55 (1999).

8. Duret, L. Evolution of synonymous codon usage in metazoans. Curr Opin Genet Dev 12, 640-9 (2002).

9. Peden, J. F. (1997). http://www.molbiol.ox.ac.uk/cu/

14. ORF Finder: http://www.ncbi.nlm.nih.gov/gorf/gorf.html

10. McInerney, J. O. GCUA: General Codon Usage Analysis. Bioinformatics 14, 372-373 (1998).

11. Greenacre, M. J. Theory and Applications of Correspondence Analysis (Academic Press, London, 1984).

12. Kohonen, T. Self-Organizing Maps (Springer-Verlag, Berlin, 1995). 13. Paulsen, I. T. et al. The Brucella suis genome reveals fundamental similarities

between animal and plant pathogens and symbionts. Proc Natl Acad Sci U S A 99, 13148-53 (2002).

15. TIGR N. meningitidis page: ftp://ftp.tigr.org/pub/data/n_meningitidis/ 16. TIGR Gene Attribute page: http://www.tigr.org/tigr-

scripts/CMR2/gene_attribute_form.dbi

17

http://www.molbiol.ox.ac.uk/cu/

http://www.ncbi.nlm.nih.gov/gorf/gorf.html

ftp://ftp.tigr.org/pub/data/n_meningitidis/

http://www.tigr.org/tigr-scripts/CMR2/gene_attribute_form.dbi

http://www.tigr.org/tigr-scripts/CMR2/gene_attribute_form.dbi

the relative synonymous codon usage neural network · rescuenet was released as version 0.9 in...

Documents