oligonucleotide probe design for large genomes using ... · oligonucleotide probe design for large...

Oligonucleotide Probe Design for Large Genomes

using Multiple Spaced Seeds

OLIGONUCLEOTIDE PROBE DESIGN FOR LARGE GENOMES

USING MULTIPLE SPACED SEEDS

BY

HAMID MOHAMADI

a thesis

submitted to the department of computing and software

and the school of graduate studies

of mcmaster university

in partial fulfilment of the requirements

for the degree of

Master of Science

c© Copyright by Hamid Mohamadi, March 2012

All Rights Reserved

Master of Science (2012) McMaster University

(Computer Science) Hamilton, Ontario, Canada

TITLE: Oligonucleotide Probe Design for Large Genomes using

Multiple Spaced Seeds

AUTHOR: Hamid Mohamadi

SUPERVISORS: Dr. William F. Smyth

Dr. G. Brian Golding

Dr. Lucian Ilie

NUMBER OF PAGES: xii, 95

ii

To my parents

Abstract

An oligonucleotide is a small fragment of DNA or RNA that is designed to hybridize

with a unique piece in a target sequence. Oligonucleotides have a wide range of

applications in molecular biology and medicine. They can be used as probes to

screen for diseases and viral infections in medicine as well as DNA microarray design,

polymerase chain reaction (PCR) amplification, and gene identification in molecular

biology.

The major computational challenge for designing oligonucleotide probes is finding

the optimal probe for each target sequence. Each probe must be specific to its target

sequence, must be sensitive in order to detect the target sequence, and the set of

oligonucleotides must be uniform under the same experimental conditions. Many

algorithms and software programs have been created for this problem, however, none

is able to solve it very well.

We introduce a new method for oligonucleotide design that employs sensitive mul-

tiple spaced seeds, and show that our algorithm computes unique and more efficient

oligonucleotides as well as executing orders of magnitude faster than the other algo-

rithms that have been proposed for the same task.

iv

Acknowledgements

I would like to thank my supervisors, Dr. Smyth, Dr. Golding, and Dr. Ilie for their

guidance and support throughout my thesis.

I am also very grateful to Shima Khoshraftar and Anahita Mansouri for their kind

help.

v

Contents

Abstract iv

Acknowledgements v

1 Introduction 1

2 Preliminaries 6

2.1 Molecular biology primer . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Organisms and cells . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 DNA, RNA, and protein . . . . . . . . . . . . . . . . . . . . . 8

2.1.3 Genome, chromosome, and gene . . . . . . . . . . . . . . . . . 12

2.1.4 Oligonucleotides . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.5 Thermodynamics of DNA . . . . . . . . . . . . . . . . . . . . 17

2.2 Related computer science notation . . . . . . . . . . . . . . . . . . . 19

2.2.1 Sequence alignments . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.2 Seeds for homology search . . . . . . . . . . . . . . . . . . . . 25

2.2.3 Suffix tree and suffix array . . . . . . . . . . . . . . . . . . . . 27

3 Related work 29

vi

3.1 ArrayOligoSelector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 GoArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 OligoArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 OligoPicker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5 OligoWiz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.6 PICKY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.7 ProbeSel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.8 ProbeSelect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.9 ProDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.10 ProMide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.11 ROSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.12 YODA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 Our proposed algorithm 45

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 General description of the problem . . . . . . . . . . . . . . . . . . . 48

4.3 The outline of our algorithm . . . . . . . . . . . . . . . . . . . . . . . 49

4.4 Encoding the input sequences . . . . . . . . . . . . . . . . . . . . . . 50

4.5 Cross-hybridization assessment . . . . . . . . . . . . . . . . . . . . . 51

4.5.1 Multiple spaced seeds for homology search . . . . . . . . . . . 53

4.5.2 Overlap complexity . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5.3 Fast homology search . . . . . . . . . . . . . . . . . . . . . . . 57

4.5.4 Intensive homology search . . . . . . . . . . . . . . . . . . . . 59

4.6 GC-content evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.7 Melting temperature management . . . . . . . . . . . . . . . . . . . . 62

vii

4.8 Secondary structure assessment . . . . . . . . . . . . . . . . . . . . . 64

5 Experimental results and evaluation 67

5.1 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2 Results from other algorithms . . . . . . . . . . . . . . . . . . . . . . 71

5.3 Evaluation and comparison . . . . . . . . . . . . . . . . . . . . . . . 82

6 Summary and conclusion 89

viii

List of Figures

2.1 A eukaryotic cell. (Farabee, 2007) . . . . . . . . . . . . . . . . . . . . 7

2.2 Structure of DNA. (from National Human Genome Research Institute) 8

2.3 Structure of RNA. (from National Human Genome Research Institute) 9

2.4 Structure of Protein (Villarreal, 2008) . . . . . . . . . . . . . . . . . . 11

2.5 Splicing. (Alberts et al., 2003) . . . . . . . . . . . . . . . . . . . . . . 13

2.6 The Standard Genetic Code (Godfrey-Smith and Sterelny, 2008) . . . 14

2.7 Central dogma of biology (Horspool, 2008) . . . . . . . . . . . . . . . 15

2.8 Two oligos hybridized with their target DNA sequences . . . . . . . . 16

2.9 Melting temperature of DNA . . . . . . . . . . . . . . . . . . . . . . 17

2.10 Dot plot of two sequences. . . . . . . . . . . . . . . . . . . . . . . . . 21

2.11 Needleman-Wunsch alignment of two sequences. . . . . . . . . . . . . 23

2.12 Smith-Waterman alignment of two sequences. . . . . . . . . . . . . . 24

2.13 Example of a hit by a spaced seed. . . . . . . . . . . . . . . . . . . . 26

2.14 Suffix tree for x = TCGTAACGACC. . . . . . . . . . . . . . . . . . . 28

2.15 Suffix array for x = TCGTAACGACC. . . . . . . . . . . . . . . . . . 28

4.1 General description of the problem. . . . . . . . . . . . . . . . . . . . 48

4.2 Possibility of overlapping hits for a. consecutive seed. b. spaced seed. 55

4.3 Example of computing overlap complexity for two spaced seeds. . . . 56

ix

4.4 The set of multiple spaced seeds used in the fast homology search phase. 58

4.5 Fast homology search. . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.6 The set of multiple spaced seeds used in intensive homology search phase. 60

4.7 GC-content evaluation process. . . . . . . . . . . . . . . . . . . . . . 61

4.8 Examples of forming secondary structures a. hairpin b. dimer. . . . . 64

4.9 The OligoDesign algorithm. . . . . . . . . . . . . . . . . . . . . . . 66

5.1 Seeds used for fast homology search (s8w10) and for intensive homology

search (s8w9). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71


search (s8w8). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71


search (s16w9). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.4 Seeds used for double fast homology search (s8w10) and (s8w8). . . . 74

5.5 Seeds used for fast homology search (s8w10). . . . . . . . . . . . . . . 75


search (s8w9). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76


search (s16w9). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.8 Eight spaced seeds of weight six used in the evaluation program. . . . 82

x

List of Tables

4.1 Summary of the ways that related algorithms assess the parameters

involved in oligo design. . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2 Encoding of input DNA sequences . . . . . . . . . . . . . . . . . . . . 50

4.3 Nearest-Neighbor parameters for DNA/DNA duplexes (SantaLucia,

1998). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1 Input data sets used for experimental results. . . . . . . . . . . . . . 68

5.2 Results by employing s8w10-s8w9. . . . . . . . . . . . . . . . . . . . . 72


5.4 Results by employing s8w10-s16w9. . . . . . . . . . . . . . . . . . . . 73


5.6 Results by employing s8w10. . . . . . . . . . . . . . . . . . . . . . . . 75


5.8 Results by employing s8w11-s16w9. . . . . . . . . . . . . . . . . . . . 77

5.9 Description of the oligo design software programs used in comparison. 78

5.10 Results for ArrayOligoSelector. . . . . . . . . . . . . . . . . . . . . . 79

5.11 Results for OligoArray. . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.12 Results for OligoPicker. . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.13 Results for OligoWiz. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

xi

5.14 Results for PICKY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.15 Results for YODA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.16 Evaluation for mousenervous. . . . . . . . . . . . . . . . . . . . . . . . 83

5.17 Evaluation for ecoli. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.18 Evaluation for bee. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.19 Evaluation for yeast. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.20 Evaluation for plasmodium. . . . . . . . . . . . . . . . . . . . . . . . 84

5.21 Evaluation for zebrafish. . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.22 Evaluation for drosophila. . . . . . . . . . . . . . . . . . . . . . . . . 85

5.23 Evaluation for chicken. . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.24 Evaluation for celegans. . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.25 Evaluation for arabidopsis. . . . . . . . . . . . . . . . . . . . . . . . . 86

5.26 Evaluation for maize. . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.27 Evaluation for mouse. . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.28 Evaluation for human. . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.29 Evaluation for mouserna. . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.30 Evaluation for rice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.31 Total comparison of the algorithms for all data sets. . . . . . . . . . . 88

xii

Chapter 1

Introduction

An oligonucleotide is a short fragment of nucleic acid polymers, DNA or RNA, that

is designed to hybridize with a unique region, which is complementary to itself, in a

target sequence. (The nucleotides A and C are said to be complementary to T and G

respectively) Here, hybridization is the process of binding two complementary DNA

sequences into a single double-stranded DNA molecule and binding occurs because

of the presence of hydrogen bonds between base pairs. In other words, the target

sequence could be uniquely identified by the oligonucleotide as a probe. A biologist

may detect the existence of the probe’s complementary fragment in the larger DNA

sequence by hybridization of the probe to some unknown DNA fragment of interest.

Oligonucleotides are generally known by their lengths which either can be short,

between 20 and 30 nucleotides, or long, between 50 and 70 nucleotides. Oligonu-

cleotides have a wide range of applications in medicine and molecular biology. They

can be used as probes to screen for diseases and viral infections in medicine as well

as DNA microarray design, polymerase chain reaction (PCR) amplification, and gene

identification in molecular biology.

1

M.Sc. Thesis - Hamid Mohamadi McMaster - Computer Science

The major computational challenge for designing oligonucleotide probes is how to

find the optimum probe for each target sequence in a set of DNA/RNA sequences. An

optimum oligonucleotide must discriminate well between its target sequence and all

other non-target sequences and the optimum designed probe for each target sequence

must be unique to that sequence in order to only hybridize with its complement region

in the target sequence and not to cross-hybridize with non-target sequences. More

precisely, each probe must be specific to its target sequence, must be sensitive in

order to detect the target sequence, and the set of oligonucleotides must be uniform

under the same experimental conditions.

In order to design the optimum set of oligonucleotides, some criteria and param-

eters have been proposed by experts (Lockhart et al., 1996; Kane et al., 2000) in this

area such as:

• Similarity of an oligonucleotide to non-target sequences

• Maximum consecutive match of an oligonucleotide to non-target sequences

• Melting temperature

• GC-content

• Secondary structure

• Low complexity regions

The first two parameters are the most important criteria among the suggested criteria

for designing the optimum set of oligonucleotides.

There are various algorithms and programs for designing oligonucleotide probes

such as ArrayOligoSelector (Bozdech et al., 2003), GoArrays (Rimour et al., 2005),

2


OligoArray (Rouillard et al., 2003), OligoPicker (Wang and Seed, 2003), OligoWiz

(Nielsen et al., 2003), PICKY (Chou et al., 2004), ProbeSel (Kaderali and Schliep,

2002), ProbeSelect (Li and Stormo, 2001), ProDesign (Feng and Tillier, 2007), Pro-

Mide (Rahmann, 2003), ROSO (Reymond et al., 2004), and YODA (Nordberg,

2004). These programs differ in what parameters and criteria they consider and how

they identify and select the optimal oligonucleotides. For low-complexity manage-

ment, most of the algorithms use a filter or mask to the nucleotide repeats, while

others apply prohibited regions defined by the user, lossless compression calculations,

the properties of the suffix array structure, or a custom complexity score.

To achieve maximum uniformity for the set of oligonucleotides, a small difference

between melting temperatures of oligonucleotides is required. Several approaches are

available for computing the melting temperature. The most commonly used approach

is applying the Nearest Neighbour model with either the parameters from SantaLucia

(1998) or the parameters from Rychlik et al. (1990). The GC-content evaluation of

the oligonucleotide sequences, which is closely related to the melting temperature, is

performed by defining a fixed range or threshold by the user and filtering out oligonu-

cleotides which are not in that range. The secondary structure assessment can be

performed in two ways; self-complementarity checking by aligning the oligonucleotide

with its reverse-complement sequence; or thermodynamic calculations to determine

the stability of potential secondary structures. The core procedure of these algorithms

for cross-hybridization assessment are based on sequence similarity search tools such

as, suffix trees, suffix arrays, and seeds. Suffix trees and suffix arrays are well known

data structures for exact pattern matching and text searching (Manber and Myers,

1990; Smyth, 2003; Puglisi et al., 2007), but they do not work well for approximate

3


text searching, so some heuristic approaches such as BLAST, which is based on seeds,

have been proposed for this task.

Seeds were first defined and used in BLAST (Altschul et al., 1990), Basic Local

Alignment Search Tool, which is the most widely used algorithm in bioinformatics

for homology search. Instead of the slow and quadratic-time dynamic programming

algorithm of Smith and Waterman (1981), which is infeasible for long sequences,

BLAST searches for a seed of 11 contiguous matches between the sequences as an

indicator of potential local similarity. On the other hand, PatternHunter (Ma et al.,

2002) uses a spaced seed; that is, the 11 matches are not consecutive but separated

by don’t care positions. The sensitivity of this spaced seed is significantly higher

than that of BLAST’s contiguous seed. Some software programs use a combination

of several spaced seeds as the chances to find the similarities increase very much. In

fact, multiple spaced seeds quickly became the state-of-the-art in similarity search in

biological applications due to their great efficiency and flexibility.

In this thesis, we present a new algorithm to design the optimum set of oligonu-

cleotide probes for a given set of target sequences. The proposed algorithm em-

ploys multiple spaced seeds as the heart of homology search procedure for cross-

hybridization management. To reach the maximum uniformity for the set of obtained

oligonucleotides, our algorithm first calculates the melting temperature by applying

the Nearest Neighbour model with the parameters from SantaLucia, then computes

the average melting temperature of all oligonucleotide candidates and finally takes

into account the candidates within the predefined fixed range from the average melt-

ing temperature. GC-content evaluation is carried out by setting up a fixed range

and filtering out oligonucleotide candidates which are not within the predefined range.

4


The secondary structure assessment and low-complexity region assessment are implic-

itly included in the cross-hybridization management step. Finally, we illustrate that

our algorithm discovers more unique and useful oligonucleotides as well as executes

orders of magnitude faster than the other algorithms that have been proposed for the

same task.

The thesis is organized in six chapters. In Chapter 2 the preliminaries and basic

notation needed in the area of molecular biology, such as genome, gene, DNA, and

oligonucleotides, as well as the related ones in computer science such as alignment, and

homology search using seeds are introduced. Related work and current algorithms for

designing oligonucleotides are briefly described in Chapter 3. In Chapter 4 we explain

our proposed algorithms in detail. Chapter 5 presents the experimental results and

also evaluation and comparison of our algorithm with the most well-known algorithms

that are available for oligonucleotide design. The evaluation and comparison will be

done using a separate program which is written by us for all algorithms. The thesis

concludes in Chapter 6 with a few remarks about the importance of our contribution

and further research that can be done.

5

Chapter 2

Preliminaries

In this chapter, we present the necessary concepts and definitions which will be used

in this thesis. It includes two sections; the first section introduces the molecular

biology terminology and concepts and the other one explains concepts and terms

from computer science.

2.1 Molecular biology primer

It is necessary to know some basic concepts in biology in order to understand the con-

cept of the thesis, so we provide a short introduction to basic biological background.

2.1.1 Organisms and cells

All organisms are composed of small cells. A cell is a fundamental working unit of

every living system which is capable of independent functioning and includes several

building blocks that are surrounded by a cell membrane. Organisms can be classified

into unicellular (consisting only of one cell) including most bacteria, or multicellular

6


including most but not all fungi, plants, and animals.

There are two types of cells: prokaryotic and eukaryotic. While prokaryotic cells

are usually independent, eukaryotic cells are often found in multicellular organisms.

Most organisms such as flowers, trees, worms, flies, mice, and humans are eukaryotes.

Prokaryotes are simpler and smaller than eukaryotes. They also lack a nucleus and

most of the other organelles of eukaryotes. The main difference between eukaryotes

and prokaryotes is that eukaryotic cells include membrane-bound compartments in

which specific metabolic activities occur. Cells make decisions through complex net-

works of chemical reactions which are called pathways.

Figure 2.1: A eukaryotic cell. (Farabee, 2007)

7


2.1.2 DNA, RNA, and protein

The chemical components of a cell are water which constitutes 70% of the cell’s weight,

small molecules (salts, lipids, amino acids, and nucleotides) which constitute 7% of

the cell’s weight, and macromolecules (proteins, DNA, and RNA) that constitute 23%

of the cell’s weight (Alberts et al., 2003).

Figure 2.2: Structure of DNA. (from National Human Genome Research Institute)

DNA, deoxyribonucleic acid, is a nucleic acid that is the major information carrier

molecule in a cell. It encodes the genetic material determining what an organism

will develop into and what an organism’s functions are. A DNA molecule, which is

called a polynucleotide, is a chain of small molecules, called nucleotides . There are

four different nucleotides grouped into two types, purines: adenosine and guanine

and pyrimidines: cytosine and thymine. They are usually referred to as bases and

8


denoted by their initial letters, A,C ,G and T. A and T are complementary, as are C

and G. As shown in Fig. 2.2, the two DNA strands are held together in the shape of

a double helix by hydrogen bonds between complementary bases.

Figure 2.3: Structure of RNA. (from National Human Genome Research Institute)

RNA, ribonucleic acid, is chemically similar to DNA and it is also constructed

from nucleotides. But it is usually single stranded and instead of the thymine (T), it

has an alternative uracil (U), which generally is not found in DNA (Fig. 2.3). RNA

does not form a stable double helix because of this minor difference, but can form

secondary structures by pairing up with itself. Several types of RNA exist which

have various functions in a cell; mRNA, or messenger-RNA, is used to carry a gene’s

message out of the nucleus; tRNA, or transfer-RNA, transfers genetic information

from mRNA to an amino acid sequence; rRNA , or ribosomal-RNA, is a part of the

9


ribosome which is involved in translation.

Proteins, which take up almost 20% of a eukaryotic cell, are the second major

building blocks and functional molecules of the cell after water. They can be classified

into the following groups;:

• structural proteins, which can be considered as the basic building blocks of an

organism such as bones and connective tissues;

• enzymes, which carry out biochemical reactions such as altering, joining to-

gether or chopping up other molecules. These reactions and the pathways they

construct is called metabolism;

• Transmembrane proteins, which are the key in maintenance of the cellular envi-

ronment, regulating cell volume, extraction and concentration of small molecules

from the extracellular environment and generation of ionic gradients essential

for muscle and nerve cell function.

There are also four levels of protein structures (Fig. 2.4):

1. primary structure, in which a protein is the chain of 20 different types of amino

acids that can be joined together in any linear order, sometimes called poly-

peptide chains, and can be represented as a string of 20 different symbols;

2. secondary structure, which is formed when sequence of amino acids affects the

folding and is usually in the form of three substructures in folded chains; two

common substructures that often can be seen: alpha-helices and beta-strands

which are typically joined by the third less regular structures, called loops ;

10


3. tertiary structure, in which a fixed relatively stable three-dimensional structure

is formed because some parts of a protein molecule chain come into contact

with each other due to various repulsive or attractive forces such as hydrogen

bonds, attractions between positive and negative charges, disulfide bridges, and

hydrophobic and hydrophilic forces between such parts;

4. quaternary structure, which is formed when more than one chain of amino-acids

form the protein.

Figure 2.4: Structure of Protein (Villarreal, 2008)

11


2.1.3 Genome, chromosome, and gene

There may be many long DNA molecules in a cell that are organized as chromo-

somes. DNA in eukaryote chromosomes winds around complex structures that are

called histones. Mitochondria, which are membrane-enclosed organelles found in most

eukaryotic cells, also contain DNA but the amount is very small in comparison to

chromosomal DNA. Mitochondrial and chromosomal DNA make the genome of the

organism. Genomes, which are included in all organisms, encode all the hereditary

information of the organism. Chromosomes of eukaryotes, which are in the nucleus,

are separated from mitochondrial genomes and contained by the nuclear membrane.

All cells in an organism, which result from DNA replication at each cell division,

have identical genomes. A gene is often a continuous chain of DNA molecules that

encode instructions on how to create proteins. A complex molecular process can read

information, which is encoded as a string of A, C, G, and T, from genes and form

a special type of a protein or a few different proteins. This process which is called

protein synthesis has three main phases: transcription, splicing, and translation.

In the transcription phase, one strand of a double-stranded DNA molecule unwinds

in the nucleus and its information is copied into a molecule of mRNA. Then, the

mRNA exits from the cell nucleus.

In the splicing phase, some segments of the mRNA, called introns, are removed

and then the remaining segments, called exons, are joined together. The way that

eukaryote genomes are organized results in the removal of introns. The DNA segment

that corresponds to the coding region of genes is not continuous, but includes exons

and introns. Exons are the segments of the gene that may or may not code for proteins

and are interspersed with splicesomal introns that generally are removed by splicing.

12


Prokaryote genes do not include introns and there is no splicing phase for them. The

consequence of splicing is final edited mRNA (Fig. 2.5).

Figure 2.5: Splicing. (Alberts et al., 2003)

In translation, proteins are made by joining amino acids together in the order that

has been encoded in the mRNA. The mRNA sequence is considered as a sequence

of triplets, called codons, that map to amino acids using the standard genetic code.

In this process, in cytoplasm ribosomes, components of cells that synthesize protein

chains, synthesize proteins using the mature mRNA transcript obtained during the

transcription stage. There are 64 codons and only 20 amino acids so that some codons

are redundant; for instance, Lysine is encoded by AAA and AAG. Figure 2.6 shows

the standard genetic code. The three-letter abbreviations such as “Ser” and “Asn”

13


are types of amino acid molecules. Each amino acid is carried to the ribosome by a

tRNA molecule that specially distinguishes one or more codons on the mRNA. Then

the amino acids are added to the nascent protein. The last step of translation is the

end part of gene expression and the final result is a protein which corresponds to the

chain encoded by mRNA.

Figure 2.6: The Standard Genetic Code (Godfrey-Smith and Sterelny, 2008)

Figure 2.7 summarizes what we have had so far in this section and gives an

overview of the central dogma of molecular biology with all the usual flows of in-

formation in solid arrows and unusual flows in dashed arrows.

14


Figure 2.7: Central dogma of biology (Horspool, 2008)

2.1.4 Oligonucleotides

Oligonucleotides, often abbreviated as oligos, are short pieces of single-stranded DNA

or RNA molecules that are designed to bind with unique positions in target sequences.

The way that an oligo binds to its target strand allows scientists to employ oligos

as research tools. An oligo can be used to bind or find its matching target sequence

even in a complex pool of millions of unrelated pieces of DNA or RNA. Using this

interesting and unique fact, researchers are able to decode and study the genetic

makeup of any living organism ranging from bacteria to humans. Figure 2.8 shows

an oligo which is hybridized to its target DNA sequence.

Oligos are usually known by their length, which either can be short, from 20 to

30, or long, from 50 to 70 nucleotide bases, and have a wide range of applications in

medicine and molecular biology. They can be used as probes to screen for diseases

and viral infections in medicine as well as DNA microarray design, polymerase chain

reaction (PCR) amplification, gene identification, northern blot, and southern blot

15


GATACCGAGGTGATGAAATGCATCGTTGAGGTCATCTCCGACACACTTTC ||||||||||||||||||||||||||||||||||||||||||||||||||

GGACGATCTTTACGGCTATGGCTCCACTACTTTACGTAGCAACTCCAGTAGAGGCTGTGTGAAAGTCAGGTCTCT

GAGAGCGGATCGGGGAGCATTTGCGGATCGGTCACTTTTTCCTC |||||| |||||||||||||||||||||||||||| | ||||

TAGTGGTGGCCTCTCGTTTAGCCCCTCGTAAACGCCTAGCCAGTGACGACGGAGAACATTGCACGT

Oligo1:

Target1:

Oligo2:

Target2:

Figure 2.8: Two oligos hybridized with their target DNA sequences

in molecular biology. Moreover, oligos can be used in diagnostic tests for genetic

diseases, like breast cancer or cystic fibrosis, or diagnostic tests for infectious diseases,

like hepatitis or AIDS. They can also be utilized in research to discover new drugs or

treatments for a variety of diseases, or producing safe and more plentiful agricultural

products.

The major computational challenge for designing oligos is how to find the optimum

oligo for each target sequence in a set of DNA/RNA sequences. An optimum oligo

must discriminate well between its target sequence and all other non-target sequences

and the optimum designed oligo for each target sequence must be unique to that

sequence in order to only hybridize with its complement region in the target sequence

and not to cross-hybridize with non-target sequences. More precisely, each oligo probe

must be specific to its target sequence which is referred as the specificity of oligos;

oligos must be sensitive in order to detect the target sequences which is known as

the sensitivity of oligos; the set of oligonucleotides must be uniform under the same

experimental conditions, such as melting temperature, which is called the uniformity

of the oligo set.

16


2.1.5 Thermodynamics of DNA

One of the important parameters for designing oligos is melting temperature, Tm. The

melting temperature of an oligo duplex is the temperature at which the oligo is 50%

annealed to its complement. This means that 50% of the molecules are single-stranded

while 50% of the molecules are in the double-stranded form (Figure 2.9). Inaccurate

prediction of Tm will increase the probability of failed assay design. Usually, the

melting temperature of an oligo depends on three major factors (Owczarzy et al.,

2008); oligo concentration - high DNA concentrations favor duplex formation; salt

concentration - higher ionic concentrations of the solvent leads to increases in Tm;

oligo sequence - generally, sequences with a higher fraction of GC base pairs have a

higher Tm than do AT-rich sequences.

Figure 2.9: Melting temperature of DNA

17


Several approaches are available for computing the melting temperature. The

most commonly used approach is applying the Nearest Neighbour model with either

the parameters from SantaLucia (1998) or the parameters from Rychlik et al. (1990).

If the concentration of the oligo is much higher than the concentration of the DNA

target, the following thermodynamic relationship can be used to predict Tm:

Tm =∆H

∆S +R ln(C/4)− 273.15 (2.1)

where ∆H (k.cal/mol) is the total energy exchange between the system and its sur-

rounding environment, ∆S (cal/mol.K) denotes the energy spent by the system to

organize itself, R (cal/mol.K) is the ideal gas constant, 1.987, and C is the molar

concentration of the oligo (Owczarzy et al., 2008).

Another major factor in designing oligos is hybridization free energy, ∆G. Hy-

bridization of a single-stranded (SS) oligo and another single-stranded sequence is a

chemical reaction where two single-stranded sequences come together to form a du-

plex. The transition from one state to another state contributes to a change in energy

of the system and can be summarized in equation 2.2:

[Oligo(SS)] + [Sequence (SS)]⇐⇒ [Duplex] (2.2)

where [Oligo(SS)] and [Sequence (SS)] denote the concentration of the oligo and the

sequence respectively in the system, and [Duplex] denotes the concentration of the

Duplex. ∆G is the change in Gibbs Free Energy (k.cal/mol) and is the net exchange

of energy between the system and its environment. It determines the stability of

the binding between an oligo and a sequence. Oligos with low binding free energy

18


with the target sequence and high binding free energy with non-target sequences are

required because they tend to form a more stable binding with their targets rather

than non-targets which results in cross-hybridization reduction. For the task of oligo

selection, a threshold may be set for binding free energy so that oligos with lower free

energy than the threshold will be picked. ∆G can be obtained by equation 2.3:

∆G = ∆H − T ×∆S (2.3)

where ∆H and ∆S are the changes in enthalpy and entropy, respectively, associated

with duplex formation, and T (Kelvin) represents the absolute temperature of the

system.

Hybridization free energy, ∆G, and melting temperature, Tm, values are derived

differently and have no correlative relationship even though they share basic compo-

nents enthalpy, ∆H, and entropy, ∆S. The only way to relate a given ∆G to a given

Tm value is to explicitly know the value of ∆H and ∆S from which they are derived.

2.2 Related computer science notation

In this section, we provide some preliminary and basic definitions and notation from

computer science used in this thesis.

2.2.1 Sequence alignments

The comparison of strings can be done in many different ways. An alignment is a

way of arranging strings to determine regions of close similarity between them.

19


One of the essential problems in biology is determining whether two or more ge-

nomic or protein sequences are related and consequently whether sequences show

similarity by chance or because of common ancestry. In biology, the term similarity

applies to sequences that are in some sense similar and has no evolutionary conno-

tations while the term homology describes sequences which are evolutionary related

and stem from a common ancestor (Golding et al., 2011). So, generally, alignment

in biology is an arrangement of two evolutionary related sequences to identify their

regions of similarity that may indicate their functional and structural relationships.

The most straightforward method for performing comparison between sequences

is a dot plot. Using a dot plot it is easy to identify the regions of similarity such

as repeats and rearrangement, in sequences. A dot plot can be created by arranging

one of the sequences along the vertical axis of a matrix and the other one along the

horizontal axis. Then, a dot is placed in all of the entries of the matrix where the

letter along both axes is the same (Figure 2.10).

A long sequence of dots on a diagonal represents regions of similarity between the

two sequences. To reduce the noise of a dot plot matrix, it is useful to filter it by

defining window sizes, w, and stringencies, s. By applying these parameters to a dot

plot matrix, a dot will be placed when there are at least s matches in the w closest

entries on the same diagonal. While dot plots provide a useful way to visualize the

sequences being compared, they are not so useful in performing an actual alignment

between two sequences. To do this, other methods are required.

Formally an alignment of two sequences X of length n and Y of length m is a

mapping X ′ and Y ′ of the same length that may differ from the original sequences X

and Y by having space characters. There may be several alignments for two sequences

20


C A G A C T G T A A

C

T

G

A

C

T

G

G

Figure 2.10: Dot plot of two sequences.

so an alignment score can be defined as follows to evaluate the alignment:

S =L∑i=1

s(X ′[i], Y ′[i]) (2.4)

where X ′[i] and Y ′[i] represents the ith character of X ′ and Y ′ respectively, s(X ′[i],

Y ′[i]) is the score of aligning the two characters X ′[i] and Y ′[i], and L is the common

length of X ′ and Y ′. An optimal alignment of two sequences is an alignment with

the maximum alignment score.

There are two types of alignment: global alignment and local alignment. A global

alignment is an alignment in which all of the characters in both sequences are involved

in the alignment while a local alignment compares regions of all possible lengths in-

stead of taking into account the total sequence, and identifies similar regions between

two sequences.

21


The first global alignment algorithm is the Needleman-Wunsch algorithm that was

developed in 1970 (Needleman and Wunsch, 1970). It is an application of the dynamic

programming approach to find the optimal alignment, which we now explain. The

idea behind the algorithm is motivated by the observation that any sub-path ending

at a position within the optimal path must itself be optimal at that position. So, the

optimal path can be identified by extending the optimal sub-path. The algorithm is

an elegant and simple way to obtain an alignment which maximizes a specific score.

The Needleman-Wunsch algorithm works as follows: First a matrix M is created

like the dot plot matrix and the entries are filled up using predefined scores for

matches, mismatches and gaps. Then, using the following relation, the score of each

entry M(i, j) is determined:

M(i, j) = max(Mi−1,j−1 + s(Xi, Yj),Mi−1,j + sgap,Mi,j−1 + sgap) (2.5)

Finally, the optimal path is identified by starting from the rightmost entry at the

bottom of matrix and walking to left and top through the matrix up to somewhere

on the topmost row or leftmost column. The time complexity of the Needleman-

Wunsch algorithm is O(mn) and the space complexity is O(mn) as well.

As an example, suppose that we want to perform the Needleman-Wunsch algo-

rithm on the two given sequences X = ACTGATTCA and Y = ACGCATCA with

smatch = 2, smismatch = −3, and sgap = −2. All steps mentioned above are summarized

in Figure 2.11 which results in the following alignment:

A C T G - A T T C A

| | | | | | |

A C - G C A T - C A

22


A C T G A T T C A

0 -2 -4 -6 -8 -10 -12 -14 -16 -18

A -2 2 0 -2 -4 -6 -8 -10 -12 -14

C -4 0 4 2 0 -2 -4 -6 -8 -10

G -6 -2 2 1 4 2 0 -2 -4 -6

C -8 -4 0 -1 2 1 -1 -3 0 -2

A -10 -6 -2 -3 0 4 2 0 -2 2

T -12 -8 -4 0 -2 2 6 4 2 0

C -14 -10 -6 -2 -4 0 4 2 6 4

A -16 -12 -8 -4 -5 -2 2 1 4 8

Figure 2.11: Needleman-Wunsch alignment of two sequences.

In the Needleman-Wunsch algorithm, highly similar and short regions may be

missed because the rest of the sequence can outweigh them. Therefore, it makes sense

to look for a local alignment. The Smith-Waterman algorithm (Smith and Waterman,

1981) discovers an alignment which identifies the optimal subsequence pair that gives

the maximum degree of similarity between the two original sequences. This means

all of the sequences might not be aligned together. By a minor modification in the

Needleman-Wunsch algorithm, the Smith-Waterman algorithm is obtained. In the

modification, an alignment path is not required to reach the boundary entries in the

last rows or columns of the alignment matrix, but can begin and end in internal

entries. To do this, zero must be the minimum score placed in the matrix. So, the

score function in the relation (2.5) is changed to the following relation:

M(i, j) = max(Mi−1,j−1 + s(Xi, Yj),Mi−1,j + sgap,Mi,j−1 + sgap, 0) (2.6)

23


Figure 2.12 illustrates an example of a local alignment for the given sequences X =

ATGCATCCCATGAC and Y = TCTATATCCGT using the Smith-Waterman algo-

rithm which results in the following alignment:

A T C C

| | | |

A T C C

A T G C A T C C C A T G A C

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

T 0 0 2 0 0 0 2 0 0 0 0 2 0 0 0

C 0 0 0 0 2 0 0 4 2 2 0 0 0 0 2

T 0 0 2 0 0 0 0 2 1 0 0 2 0 0 0

A 0 2 0 0 0 2 0 0 0 0 2 0 0 2 0

T 0 0 4 2 0 0 2 0 0 0 0 4 2 0 0

A 0 2 0 0 0 2 0 0 0 0 2 0 0 2 0

T 0 0 4 2 0 0 4 2 0 0 0 4 0 0 0

C 0 0 2 0 4 0 0 6 4 2 0 0 0 0 2

C 0 0 0 0 2 0 0 4 8 6 4 2 0 0 2

G 0 0 0 2 0 0 0 2 6 5 3 1 4 2 0

T 0 0 2 0 0 0 2 0 4 3 2 5 3 1 0

Figure 2.12: Smith-Waterman alignment of two sequences.

Creating the alignment matrix and backtracking in it to find the optimal alignment

takes O(nm) time and space. Due to the quadratic time complexity of the dynamic

programming algorithm, seeds are used in many alignment applications in practice.

Further information about seeds and how they work will be given in next section.

24


2.2.2 Seeds for homology search

The quadratic time complexity of the dynamic programming algorithm of Smith-

Waterman makes it impossible to apply it for large sequences. Hence, approaches that

can quickly identify similarity between sequences are more required. BLAST (Altschul

et al., 1990), Basic Local Alignment Search Tool, the most widely used algorithm for

similarity search, was based on the realization that having a long subsequence in

common may result in significant similarity and consequently local alignment. The

default length for the common subsequence is 11. In other words, the two sequences

being aligned need to have 11 consecutive positions which are identical, or simply

11 matches, and it is represented as a seed 11111111111; a 1 stands for a match.

The main idea is that it would be easier to search exact matches than approximate

ones. However, PatternHunter (Ma et al., 2002) mentioned that searching for 11

matches that are not consecutive has a higher probability of finding actual alignments.

Their seed is 111*1**1*1**11*111; a 1 stands for a match and * stands for a don’t

care position. It means, when searching for a potential alignment, only positions

corresponding to 1’s are checked, whereas those corresponding to *’s are ignored.

Such a seed is called a spaced seed as opposed to the contiguous seed of BLAST. The

new seed, implemented in the PatternHunter algorithm, is much more sensitive, in

the sense that it has a much higher chance of finding actual alignments. The number

of 1’s in a seed is called the weight of the seed whereas the total number of symbols

is the length.

The sensitivity of a seed is a measure of the seed’s ability to detect similar regions

between sequences. A formal model is required in order to precisely define the sensi-

tivity. First, given two sequences, we define the similarity level, p, as the probability

25


of having a match between the two sequences. Then we model the alignment as a

Bernoulli random sequence R of 0’s and 1’s so that the probability of a 1 is p, the

similarity level. We shall denote the length of R by N . Given a seed s, we say that

s hits R at position k if aligning the end of s with position k in R causes all 1’s in s

to align with 1’s in R. An example of a hit is shown in Figure 2.13.

11101001111100101101101110010 111*1**1*1**11*111

Figure 2.13: Example of a hit by a spaced seed.

The sensitivity of the seed s is formally defined as the probability that s hits R. In

addition to s, this probability depends on the similarity p and length of the random

region, N . There is a dynamic programming approach for computing the sensitivity

of a given seed (Li et al., 2004).

Another version of spaced seeds called transition-constrained seed that is a biolog-

ically motivated version of spaced seeds. It was first introduced and used in the YASS

program (Noe and Kucherov, 2005). In addition to 1’s for matches and *’s for don’t

cares, the transition-constrained seed contains the new character @ which stands for

either a match or a transition; that is, a substitution A←→ G or C←→ T. The biolog-

ical motivation for this is that transitions are more common than transversions, that

is, A/G ←→ C/T. The seed used in YASS is 1@1**11**1*11@1.

Employing several seeds together for homology search can greatly increase the

sensitivity. This set of several different seeds is called multiple spaced seeds. The

multiple spaced seeds were first used in PatternHunter II (Li et al., 2004) which uses

16 spaced seeds, each of which has 11 matches. When, at least, one of the seeds

26


of multiple spaced seeds detects a random region R, we say that multiple spaced

seeds hit R. Hence, the definition of sensitivity and also the dynamic programming

algorithm which computes it can be extended to the case of multiple spaced seeds.

2.2.3 Suffix tree and suffix array

Suffix trees and suffix arrays are two important data structures for text searching and

indexing that are widely used in computational biology and bioinformatics applica-

tions (Manber and Myers, 1990; Smyth, 2003; Puglisi et al., 2007).

For a given string x, there are two special kinds of substrings x[i..j] which are vey

important. For any integer j ∈ 0 . . . n, we say x[1..j] is a prefix of x. For any integer

i ∈ 1 . . . n+1, we say x[i..n] is a suffix of x. The suffix tree of a given string x is a tree

whose leaves denote the suffixes of x. In other words, each suffix of x is represented by

a path from the root to the corresponding leaf in the suffix tree. The suffix array of

a given string x is an array of integers that gives the starting positions of the suffixes

of string x sorted in lexicographical order. The longest common prefix, denoted by

LCP, of two strings is the longest string which is prefix of both. This value, stored

alongside the list of prefix indices, indicates how many characters a particular suffix

has in common with the suffix directly above it, starting at the beginning of both

suffixes. The LCP is useful in making some string operations more efficient. For

example, it can be used to avoid comparing characters that are already known to be

the same when searching through the list of suffixes. Figures 2.14 and 2.15 represent

examples of suffix tree, suffix array and LCP array for the string:

index: 1 2 3 4 5 6 7 8 9 10 11

x : T C G T A A C G A C C

27


A T C G

11

C G

C

8

C

CGACC

5

A

GACC

6

9

10

TAACGACC

2

7

ACC

ACC

TAACGACC

3

AACGACC

4

CGTAACGACC

1

Figure 2.14: Suffix tree for x = TCGTAACGACC.

i SA[i] SuffixSA[i] LCP[i]

1 5 AACGACC 0

2 9 ACC 1

3 6 ACGACC 2

4 11 C 0

5 10 CC 1

6 7 CGACC 1

7 2 CGTAACGACC 2

8 8 GACC 0

9 3 GTAACGACC 1

10 4 TAACGACC 0

11 1 TCGTAACGACC 1

Figure 2.15: Suffix array for x = TCGTAACGACC.

28

Chapter 3

Related work

Several studies have been performed on oligonucleotide probes, and many algorithms

and programs have been proposed for designing and selecting oligonucleotides in the

literature. In this chapter, we present the best available algorithms and programs for

designing oligonucletotides.

3.1 ArrayOligoSelector

ArrayOligoSelector (Bozdech et al., 2003) designs optimized oligonucleotide probes by

considering some parameters such as uniqueness in the genome, sequence complexity,

lack of self-binding, GC-content and proximity to the 3′ end, or the right end, of

the gene. It consists of two main steps; in the first step, the program computes

scores of uniqueness, sequence complexity, lack of self-binding and GC-content for

each candidate oligo. The details of scoring for the mentioned parameters are as

follows:

• Uniqueness: The binding free energy of a candidate oligo to its most homologous

29


sequence is considered as the uniqueness score. BLASTN, BLAT, or gfclient

are used to locate the most homologous sequence followed by a calculation of

the theoretical binding energy. The binding free energy is computed using the

nearest-neighbor model with the thermodynamic parameters from SantaLucia

(SantaLucia, 1998).

• Sequence complexity: The program employs the LZW compression algorithm

(Ziv and Lempel, 1977) to compute the sequence complexity score which is the

difference in bytes between the oligo sequence and its compressed version.

• Self-annealing: Using the Smith-Waterman local alignment algorithm (Smith

and Waterman, 1981), ArrayOligoSelector computes the alignment score of the

optimum local alignment between the oligo sequence and its reverse complement

as a measurement of the secondary structure created by the self annealing of

an oligo.

• GC-content: The score is computed as the GC percentage of the oligo sequence.

After the first step, the obtained scores are used in the second step to select the

oligos that are unique for the target sequences, have low level of internal repeat, have

low tendency for self-annealing, and are within a narrow range from the target GC

percentage which is specified by the user.

3.2 GoArray

GoArray (Rimour et al., 2005) has been developed to overcome the problems of

classical approaches for designing oligonucleotide such as the lack of adaptability to

30


complex biological systems. The essential factor regarding the adaptability is cross-

hybridization which could result in misinterpretation of biological results. In order to

compute the specificity of an oligonucleotide, GoArray follows the sequence similarity

and maximum consecutive match parameters as follows (Kane et al., 2000):

• The oligo must not have more than 75% similarity with a non-target sequence.

• The oligo must not have a stretch of more than 15 identical bases with a non-

target sequence.

The outcomes of performing in silico test revealed that in a complex biological

system, long oligos, from 50 to 70 bases, don’t have good specificity while short oligos,

20 to 30 bases, have higher specificity so they seem to be more adapted but perhaps

with low sensitivity. Hence, the algorithm creates the oligos by concatenating two

specific short sequences to achieve both specificity and sensitivity. To do so, GoArray:

• Reads each sequence from the 3′ end to find the first specific sequence

• Checks specificity of the sequence using BLAST. If the sequence shows an align-

ment with 75% or more similarity or contains a stretch of 15 identical bases with

non-target sequence it is considered as non-specific.

• Looks for the second specific sequence.

• Merges two sequences using randomly chosen bases, called linker, and checks

the specificity of the whole oligo.

• Checks for melting temperature, secondary structure, and prohibited sequences.

31


3.3 OligoArray

OligoArray (Rouillard et al., 2003) takes into account several criteria for designing

and selecting oligonucleotide probes. In addition to percentage and length of sequence

similarities, the specificity of an oligo is computed by thermodynamic properties of

hybridization to its target sequence which can particularly consider regions with short

and high GC-content value and results in stable cross-hybridization at temperatures

commonly used during hybridization. By allowing more flexibility for adjusting the

sequence length by one or a few bases, OligoArray tries to consider a narrow melt-

ing temperature distribution rather than a uniform oligo length to achieve a better

uniformity during hybridization. The outline of this algorithm is as follows:

• Sequences are masked for the existence of prohibited patterns, such as regions

of same bases (GGGGG, CCCCC, TTTTT, AAAAA) or di- and tri-nucleotide

repeats extending over more than 10 bases. All positions corresponding to these

prohibited regions are filtered and automatically masked.

• Melting temperature is calculated for each oligo candidate by applying Nearest

Neighbour model with parameters from SantaLucia (1998) and the user should

identify the acceptable range.

• The oligo candidates are checked for the absence of strong secondary structures

at the hybridization temperature.

• The specificity of oligos is examined by considering all possible cross-hybridizations

of the oligo with similar sequences using BLAST and also by performing ther-

modynamic calculations. The oligo is considered to be specific for its target

32


sequence, if there is no possible cross-hybridization with melting temperature

above the specificity threshold set by the user.

• The GC-content range and the position in transcript are checked for the re-

maining oligo candidates.

3.4 OligoPicker

OligoPicker (Wang and Seed, 2003) helps in selecting oligo probes for each of the tar-

get DNA sequences given for microarray spotting. The algorithm takes into account

the following criteria for designing and selecting the oligos:

• Location in the sequence: adjacency to 5′, (left end) or 3′ (right end) end

according to random or oligo dT priming. In general, the preferred oligos for

optimal sensitivity in a random primed labeling will be placed as close to the

5′ end as possible.

• Melting temperature uniformity: OligoPicker calculates the Tm for all oligo

candidates using the following formula and then determines the median Tm for

oligos:

64.9 + 41× gcCount/oligoLength− 600/oligoLength (3.1)

where gcCount is the number of all G’s and C’s in an oligo. An oligo candidate

is ignored if its Tm is not within 5◦ of the median Tm.

• Probe accessibility: The probability of forming secondary structures is very

high in regions of significant self-complementarity. Hence, oligo candidates are

33


checked for homology to the complementary strand of their cognate sequences

using BLAST.

• Reduced cross-hybridization: Because contiguous base pairing is the single most

important determinant of cross-hybridization for OligoPicker, it considers the

rejection of contiguous sequence identity as the primary filter in the oligo selec-

tion scheme. To do so, OligoPicker uses a hash table to quickly search for the

common stretch between sequences.

• Evasion of non-coding RNA and low complexity regions: When total RNA is

used as the starting material, interfering of RNA other than mRNA with array

hybridization may be a practical concern. To address this concern, sequence

regions similar to rRNA or snRNA (small nuclear RNA) are skipped during

the oligo selection procedure by using both contiguous base match screening

and BLAST. Low-complexity regions may also result in cross-hybridization so

they are identified by the DUST program (Hancock and Armstrong, 1994) and

avoided when selecting oligos.

3.5 OligoWiz

OligoWiz (Nielsen et al., 2003) defines some parameters for designing and selecting

oligos and for each parameter it calculates a score between 0 and 1. The final score of

an oligo candidate would be the weighted sum of its parameter scores that is between

0 and 1 as well. The set of parameters that are considered in OligoWiz is:

• Cross-hybridization: Assessment of the cross-hybridization degree is carried out

by calculating the homology score for each oligo sequence using BLAST.

34


• Melting temperature difference: It is necessary that all oligos perform well un-

der similar hybridization situations. An effective parameter of an oligo related

to the hybridization property is the melting temperature and the minimal dif-

ference between the Tm of the all oligos is preferred. OligoWiz employs the

Nearest Neighbour model with parameters from SantaLucia and uses the fol-

lowing formula for Tm calculation:

Tm =1000∆H

A+ ∆S +R ln(CT/4)+ 16.6 log10[Na

+]− 273.15 (3.2)

where ∆H (k.cal/mol) is the total energy exchange between the system and

its surrounding environment, A is a constant correcting for helix initiation,

∆S (cal/mol.K) denotes the energy spent by the system to organize itself, R

(cal/mol.K) is the ideal gas constant 1.987, CT is the molar concentration of

the oligo, and [Na+] is the molar concentration of salt.

• Position within transcript: A score would be assigned to each oligo based on its

position in the target sequence. For example, oligos with target positions closer

to the starting point of reverse transcriptase are more desired.

• Low-complexity filtering: A low-complexity score is assigned to skip oligos com-

posed of very common regions in the oligo design process. To estimate the

low-complexity measure for an oligo, a list of sequence subregions with related

information content is generated for each species. Oligos that contain those par-

ticular regions are considered as low-complexity oligos and would be assigned a

low score.

• GATC-only score: Each oligo sequence that contains bases different from A, C,

35


G, and T will be given score 0.

3.6 PICKY

PICKY (Chou et al., 2004) has been proposed to design the computationally opti-

mized oligo probes. To do so, PICKY:

• Uses a generalized suffix array for both the sequence and its complement that

allows quick identification of repetitive, low complexity, self-similar and self-

complementary regions in the oligo sequence.

• Utilizes the suffix array to check the maximum consecutive match parameter,

i.e. to make sure that no oligo sequence have stretch equal to or longer than

the maximum match length of 15 bases in common with other sequences. This

is performed by sweeping across the suffix array and checking two neighbours

of each suffix to detect all regions on all sequences that must be omitted.

• Identifies all other sequences that are similar to each oligo sequence. Again,

using the suffix array, similar non-target sequences can be quickly realized

and their melting temperatures with oligo candidates can then be calculated.

PICKY calculates the melting temperature by the following equation:

Tm =∆H

∆S +R ln(C/4)+ 12.0× log10[Na

+]− 273.15 (3.3)

First, the melting temperature of an oligo candidate with its target sequence is

computed. Then to avoid imperfectly matched cross-hybridization, the melting

36


temperatures between each oligo candidate and all potential non-targets are cal-

culated. Using the suffix array, PICKY identifies and aligns an oligo candidate

with all non-targets with up to the similarity level of the sequence similarity pa-

rameter so the oligo candidates not following the sequence similarity parameter

will be skipped.

• Compares target and non-target melting temperatures of all oligo candidates to

identify a subset that can specify each gene, has the minimum chance for cross-

hybridization, has a uniform range for melting temperature, and maximizes the

distance between the lowest target and the highest non-target melting temper-

atures of the chosen set.

3.7 ProbeSel

ProbeSel (Kaderali and Schliep, 2002) considers oligos which are perfectly comple-

mentary to their target sequences and are unique up to k mismatches in order to

design oligos that bind specifically to the target sequences. All steps of the algorithm

are summarized as follows:

• Creates a generalized suffix tree of all target sequences and their reverse com-

plement which helps to identify non-unique oligos.

• Removes oligo candidates that don’t satisfy the pre-specified length.

• Eliminates oligo candidates that are similar to non-target sequences more than

the allowed threshold.

37


• Skips oligo candidates that hybridize with their target in a melting temperature

less than the predefined threshold.

• Aligns the remaining oligo candidates with their complementary targets and

calculates the melting temperature.

• Picks the final temperature T and one oligo for each of the target sequences. T is

a temperature threshold and the melting temperature of all oligos with their in-

tended target must be higher than T whereas all undesired cross-hybridizations

melting temperatures must be lower than T .

The first two undesired oligo elimination steps significantly decrease the number of

alignments that must be performed between oligos and targets.

3.8 ProbeSelect

ProbeSelect (Li and Stormo, 2001) concentrates on the specificity of input sequences

for designing the optimal set of oligonucleotide oligos. First, a set of oligo candidates is

created from oligos that maximize the minimum number of mismatches to every other

gene in the genome. Then, it selects the optimal oligonucleotides from the mentioned

set. When an oligo hybridizes with the target sequence in an acceptable range of

hybridization free energy and maximizes the difference in free energy with non-target

sequences, it will be picked up as an optimal oligo. The outline of ProbeSelect is as

follows:

• Creates a suffix array of the coding DNA sequences of a genome from an organ-

ism.

38


• Generates a “landscape” for each gene using the suffix array. A landscape of a

sequence includes the occurrence of all words of that sequence. In this case, the

reason for using the landscape is to identify low frequency words in the rest of

the genome that form the set of unique oligos.

• Selects oligo candidates using the results from testing many genes. Oligos with

the lowest frequencies at the level of subword lengths are those that occur

least in the rest of the genome, even with a few mismatches. Based on this

evidence, the algorithm considers all the subword frequencies for each oligo and

selects ones which have the lowest values, therefore supposed to have the fewest

approximate matches elsewhere.

• Looks for matching sequences in the whole genome. The Myers algorithm (My-

ers, 1999) is applied to identify all match positions for the candidate oligos in

the coding sequences of the genome with four or fewer mismatches for short

oligos, 10 or fewer for oligos of length 50 bases, and 20 or fewer for oligos of

length 70 bases.

• Locates match sequence positions in all genes. A binary search in a sorted array

of all sequences gives the positions of the matches in a sequence.

• Calculates the free energy and melting temperature for each oligo and its target

sequence. The free energy is computed using the alignment of the oligo sequence

and its target because it is usually the lowest energy structure.

• Picks up the oligos that have the most stable hybridization with their target

sequence, thus resulting in good discrimination from potential non-targets.

39


3.9 ProDesign

ProDesign (Feng and Tillier, 2007) utilizes spaced seeds for designing suitable oligos

which allows the inclusion of more mismatches between an oligo sequence and its tar-

get sequence. The algorithm can design both gene-specific and group-specific oligos.

The algorithm consists of six steps explained below:

• Creates a hash table for each spaced seed. The hash table consists of related

words of each seed where every 1 in the seed is replaced by any possible nu-

cleotide A, C, G and T encoded by two bits, 00, 01, 10, and 11 respectively.

The size of each hash table is 4w where w is the weight, the number of matches,

of the seed.

• Checks each word in the hash table to see whether that word is specific to a

group of sequences or not. The specificity of the word is obtained by counting

its occurrences in the group sequences. Let hi,j mark the occurrence of the word

xi in the sequence j; it equals one if the word occurs in the sequence and zero

otherwise.

• Computes the clusters of words. Let Hi,k mark the occurrence of the word xi

in the group k. If the value of hi,j is 1 for all the sequences in the group, then

Hi,k is 1. For each word in the hash table a list of groups with Hi,k = 1 is built.

A word is specific to a group if Hi,k = 1 only for that group. Some groups have

one or even more specific words and some groups may not have a word. The

latter groups are considered later to be reclustered into other groups.

• Considers the specific words for finding oligo candidates for the group. The user

can specify the length of the required oligos. If the group-specific word satisfies

40


the length threshold it will be returned as the group oligo. Otherwise, two or

more words need to be joined together to provide the required length. In this

case, ProDesign starts with selecting a random sequence in each group. Then,

the position of the specific word is found by scanning the sequence. The scan

is extended forward to find the second specific word. In order to be joined, two

words must be either overlapping or the gaps between them must be less than

3 bases. The joined word starts from the first character of first word until the

last character of the second word.

• Selects the modified word as the final oligo for each group only if it is specific

for all sequences of the group.

• Checks all oligo candidates in order not to have low-complexity regions, satisfy

the melting temperature constraints, and %GC content requirements.

3.10 ProMide

ProMide (Rahmann, 2003) is a specific and fast algorithm to design and select short

oligonucleotides of length up to 30 bases. The algorithm can design oligos for large

data sets in a reasonable time. ProMide considers two parameters related to the lack

of specificity to design specific probes:

• The longest common factor lcf which is the longest common region that appears

in both an oligo and a sequence.

• The longest common factor with one mismatch lcf1 which is the longest common

region that appears in both an oligo and a sequence with at most one mismatch.

41


The idea behind this approach is that lcf can approximate the lack of specificity

of an oligo better than the number of mismatches between an oligo and non-target

sequences which the oligo specificity is checked against. A long lcf of the oligo and

non-targets denotes the higher chance of undesired cross-hybridization. However,

this parameter may be too hopeful at times. For example, it does not exclude oligos

whose lcf is low even though they may have a long subsequence in common with

non-targets by allowing one mismatch. Therefore, lcf1 is also taken into account to

solve this issue.

The algorithm first selects a set of oligo candidates that fulfill the specified melting

temperature or length range. Then, the oligos in the set will be ranked by compar-

ing their lcf and lcf1 vectors which include the length of the factors between each

oligo and all non-target sequences, and incorporate some additional sequence-specific

restrictions. Using memory-efficient enhanced suffix arrays, the lcf and lcf1 vectors

can be computed very fast.

3.11 ROSO

ROSO (Reymond et al., 2004) employs BLAST to compute the specificity. Because

specificity analysis in the design phase using BLAST is independent of the oligo

selection phase, the user can separate the time-consuming BLAST step in order to

select the oligos efficiently.

• Filters the input sequences by eliminating identical genes, repetitions of bases,

and degenerated bases.

• Checks for potential cross-hybridizations using BLAST.

42


• Removes oligonucleotide probes that form a stable secondary structure.

• Calculates the melting temperature of each oligo candidate and chooses a set

of oligos in way that minimizes the variability of the melting temperature.

• Picks up the optimal set of oligos according to the GC-content rate, the first

and the last bases (preferably a G or a C), the number repetitions, and the

hybridization free energy.

3.12 YODA

YODA (Nordberg, 2004) designs and verifies oligos in several steps of increasing

computational intensity, with undesired candidate oligos being eliminated at each

step. The steps are identifying average melting temperature and GC-content, finding

prohibited sequences and contiguous stretches of identities, computing Tm , looking

for potential secondary structure, searching for potential dimerization, and checking

the similarity of oligo candidates to non-targets. The outline of the algorithm is as

follows:

• Identifying average Tm and GC-content: The algorithm calculates the Tm by

employing the NN model with the parameters from SantaLucia (SantaLucia,

1998). Then, it computes the average Tm for all oligos of the specified length.

The user determines the range of acceptable Tm’s. Also, YODA computes the

average GC-content for all oligo candidates of the specified length.

• Prohibited sequences and contiguous identities: Oligo candidates are checked

for the existence of any prohibited region or any long stretch of consecutive

43


identities with non-target sequences. Also, the user can determine prohibited

sequences, such as Poly-X (e.g. AAAAA or TTTTT) sequences, to be avoided

in oligos.

• Filter for Tm: All oligo candidates passing the previous steps are checked for

Tm. The oligo candidate is rejected, if its Tm is not in the range specified by

the user.

• Filter for secondary structure: Oligo candidates are examined for potential

stem-loop structures by searching for short stretches of complementary se-

quences, called stems, that are separated by few bases (the loop). Remaining

candidates for each sequence are stored in a temporary file for later sorting and

validation.

• Sort and validate candidates: Each target sequence can have several oligo candi-

dates. Picking up the final oligo, or set of oligos, for each sequence is performed

by some Probe Sorter procedures. Oligo candidates picked by a Probe Sorter are

considered for final validation, which checks the oligo for potential dimerization

and sequence similarity to non-target sequences.

44

Chapter 4

Our proposed algorithm

In this chapter, we introduce and explain all details of our proposed algorithm for

designing and selecting the best set of oligonucleotide probes for a given set of se-

quences.

4.1 Motivation

As we have seen in the previous chapter, several algorithms and programs exist for

oligo design which differ in what parameters and criteria they consider and how they

identify and select the optimal oligonucleotides.. In order to design the optimum

set of oligonucleotides, some criteria and parameters related to specificity, sensitivity

and uniformity of oligo probes have been addressed by experts (Lockhart et al., 1996;

Kane et al., 2000) in this area such as: similarity of an oligonucleotide to non-target

sequences, identity or maximum match of an oligonucleotide to non-target sequences,

low complexity regions, GC-content, secondary structure, and melting temperature.

The first two parameters are heavily concerned with cross-hybridization evaluation

45


and considered as the most important criteria among the suggested criteria for de-

signing the optimum specific set of oligonucleotides. From the above parameters

and criteria, maximum consecutive match, sequence identity, GC-content, and low-

complexity regions are related to the specificity of oligos, secondary structure is related

to the sensitivity aspect of oligos, and the melting temperature is related to unifor-

mity feature of the set of oligos. Table 4.1 summarizes the ways that each algorithm

deals with the parameters involved in oligo design task.

Table 4.1: Summary of the ways that related algorithms assess the parameters in-volved in oligo design.

Algorithm Cross-hybridization

GC-content Tm Low-complexity

Secondarystructure

ArrayOligo Thermodynamicsand BLAST

User-defined - LZW Self-complementarity

GoArray BLAST - NN Prohibitedsequence

Mfold

OligoArray Thermodynamicsand BLAST

User-defined NN Masking—repeats

Mfold

OligoPicker BLAST - G+C DUST Self-complementarity

OligoWiz Thermodynamicsand BLAST

- NN Customscoring

-

PICKY Suffix array andthermodynamics

User-defined NN Suffix array Suffix array

ProbeSel Suffix array andthermodynamics

- NN - Mfold

ProbeSelect Suffix array andthermodynamics

- NN Masking—repeats

Self-complementarity

ProDesign Homology searchbased on seeds

User-defined NN - Mfold

ProMide Suffix array andthermodynamics

- NN - -

ROSO BLAST User-defined NN Masking—repeats

Thermodynamics

YODA Custom method User-defined NN Prohibitedsequence

Self-complementarity

46


We are interested in designing and selecting the set of oligos so that each oligo is

specific to its target sequence to achieve the highest specificity, each oligo is sensitive

in order to detect the target sequence to reach the highest sensitivity, and the set

of oligos is uniform under the same experimental conditions to obtain the maximum

uniformity.

As shown in Table 4.1, most algorithms are heavily based on BLAST for cross-

hybridization assessment and some other ones rely on suffix arrays and suffix trees.

BLAST uses the simple strategy of finding short consecutive seed hits, which are then

extended into longer local alignments. This way of similarity search has a key tradeoff:

increasing the size of seeds decreases sensitivity while decreasing size of seeds increases

the running time. Therefore, the BLAST-based algorithms have limited sensitivity

for cross-hybridization assessment because of applying the consecutive seeds. On the

other hand, suffix trees and suffix arrays, as mentioned before, are well known data

structures for exact pattern matching and text searching, but they do not work well

for approximate text searching. Therefore, algorithms and programs that employ the

suffix arrays or suffix trees as the heart of similarity search tools for cross-hybridization

assessment are not efficient and sensitive for this task as well.

Due to the limitations of current algorithms and programs and according to the

above discussion, there is a clear need for designing an efficient algorithm that is very

accurate in terms of specificity, sensitivity, and uniformity as well as running very

fast on large input data sets. This need is the most important motivation for us to

perform this research work.

47


4.2 General description of the problem

The general description of the problem is illustrated in the Figure 4.1.

  Output: Best Set of Oligos   Oligo Design Algorithm

>ref|NC_002655.2|:190-273 ATGAAACGCATTAGCACCACCATTACCACCACCATCACCACCACCATCACCATTACCATTACCACAGGTAACGGTGCGGGCTGA

>ref|NC_002655.2|:2818-3750 ATGGTTAAAGTTTATGCCCCGGCTTCCAGTGCCAATATGAGCGTCGGGTTTGATGTGCTCGGGGCGGCGGTGACACCCGTTGATGGTGCATTGCTCGGAGATGTAGTCACGGTTGAGTCGGCAGAGACATTCAGTCTCAACAACCTCGGACGCTTTGCCGATAAGCTGCCGTCAGAACCACGGGAAAATATCGTTTATCAGTGCTGGGAGCGTTTTTGCCA

>ref|NC_002655.2|:5251-5547 GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCAACGCAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGATAATCGCGGTTATTACTGGGATGGCGGTCACTGGCGCGACCACGGCTGGTGGAAACAACATTATGAATGGCGAGGCAATCGCTGGCACCCACACGGACCGCCGCCACCGCCGCGTCACCATAAGAAAGCTCATCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA …

?

50, 81.42, ref|NC_002655.2|:1081779-1082243 TCCACCCGCCGTATTGCGGCAAGTCGTATTCCGGCTGATCACATGGTGCT

54, 81.00, ref|NC_002655.2|:c4933838-4931802 TGCATTGCTCGGAGATGTAGTCACGGTTGAGTCGGCAGAGACATTCAGTCTCA

52, 78.59, ref|NC_002655.2|:c5451565-5446463 TGTGCAAAGTGCTCCCTGTGGATTGACCAATGTCGGGGAACAACGTGGAACA

…

  Input: DNA Sequences

Figure 4.1: General description of the problem.

As input, we have a set of DNA or RNA sequences which are generally in fasta

format. This sequence format has the minimum amount of information. A fasta file

includes a ‘>’ sign in the beginning of a line to denote the beginning of a new sequence

followed by a phrase to specify the sequence title (see Figure 4.1). The sequence

information itself follows immediately. No other information is stored within a fasta

file. The output of the problem would be the best set of oligonucleotide probes that

can be in any format or in the fasta format, the same as input sequences. The goal

is to design an efficient algorithm for extracting the best set of oligos from the input

DNA/RNA sequences that have the maximum specificity, sensitivity, and uniformity

according to the mentioned parameters and criteria.

48


4.3 The outline of our algorithm

The proposed algorithm starts by encoding the input sequences and continues by

considering both the sequence identity percent and the maximum consecutive match

parameters for cross-hybridization assessment and also GC-content parameters to

reach the maximum specificity for designed oligo probes. To perform the homology

search for cross-hybridization, our algorithm looks for similar regions in input se-

quences using multiple spaced seeds and hashing. Then, GC-content evaluation is

performed to finish the specificity issue. Next, the uniformity is obtained by perform-

ing the melting temperature management. After that, the sensitivity of oligo probes

will be evaluated by self-annealing and secondary structure assessment. At the end,

an intensive homology search is performed to reach the maximum specificity. The

outline of our proposed algorithm for designing and selecting the best set of oligo

probes is as follows:

• Encoding the input sequences

• Light cross-hybridization assessment

• GC-content evaluation

• Melting temperature management

• Secondary structure assessment

• Intensive cross-hybridization assessment

• Final oligo selection

Each step of the algorithm will be explained in detail in the following subsections.

49


4.4 Encoding the input sequences

We denote the set of input sequences by G = {g1, g2, . . . gk}, where gi denotes the ith

sequence and is a string over the alphabet Σ = {A,C,G,T}. After reading the set

of input sequences, all sequences are merged into a single vector V = g1g2 . . . gk. To

encode the input vector of all sequences, we use the following table to represent each

nucleotide base:

Table 4.2: Encoding of input DNA sequences

Nucleotide base Coded representationA 00C 01G 10T 11

Using the above coding, it is possible to reduce the amount of memory to a

quarter of original size because in the coded representation we require only two bits

to represent each base while in regular representation, one byte is used per nucleotide

base. This is very helpful and efficient when dealing with large genomes such as the

human genome whose size is about 3 GB in normal representation that will be 0.75

GB in coded representation.

Another advantage of this encoding is utilizing the bit-parallel nature of computer

words. In other words, we can consider the vector of input sequences as blocks

of computer words of 32 or 64 bits that correspond to 16 or 32 nucleotide bases

respectively. Therefore, we can speed up the comparison of k bases by the factor of

W/2, where W is the length of the computer word.

50


4.5 Cross-hybridization assessment

Generally, the cross-hybridization assessment is carried out by performing a homology

search. The goal of a homology search is to find the similar regions between sequences.

As we have seen in the second chapter, there are several ways to do a homology search

such as: applying dynamic programming approaches, e.g. Smith-Waterman (Smith

and Waterman, 1981), using heuristic approaches, e.g. FASTA (Lipman and Pearson,

1985) and BLAST (Altschul et al., 1990), or utilizing text searching and indexing data

structures, e.g. suffix trees and suffix arrays.

Because dynamic programming approaches for homology search have quadratic

time complexity, they are slow in cases we have vey long sequences to be compared

such as biological sequences. It can take linear time to search exact occurrences of

patterns in sequences or texts but a time complexity linear in the size of the long

sequences for finding exact patterns is too much so time linear in the length of the

pattern is desired. To do so, an index such as a suffix tree or suffix array on the

long sequence or text can be created. Constructing this index takes linear time in

the sequence size, but the point is that it can be reused several times which provides

very fast searching throughout the indexed sequence.

Usually, in biological applications we are interested in looking for approximate

matches between genomic or proteomic sequences rather than finding exact matches.

This is because the biological mechanisms are made in a way to allow for errors so

our search algorithms should tolerate the errors or mismatches in order to provide

meaningful results. However, there are no approximate indexes that are time and

space efficient as in the exact case. Therefore, suffix trees and suffix arrays may not

be appropriate choices for homology search.

51


Several techniques have been proposed to handle approximate indexes. Using

multiple spaced seeds is the best-known approach to construct indexes of texts and

sequences. Here, the indexes are hash tables constructed for multiple spaced seeds

that are explained in detail below and that will be considered as the heart of the

homology search procedure for oligo probe design in our proposed method.

Now, we explain in details why multiple spaced seeds are so useful for approximate

sequence searching. Consider a spaced seed s of length l and weight w, which is the

number of 1’s. Also, consider a random region R of length N and similarity level

p, as what we had in the definition of sensitivity. The expected number of hits that

s has in R is (N − l + 1)pw, since there are N − l + 1 places where s can hit and

each has probability pw. Now, if the weight of the spaced seed is increases by 1, then

the expected number of hits will be decreased by a factor of p. Assuming that the

four bases A, C, G, and T appear with the equal probability, which means that the

number of expected hits is only a quarter of the previous one. Less hits means less

wasted ones, that is, less false positives and therefore increased specificity. However,

increasing the weight of a seed also decreases the true positives, and therefore the

sensitivity. In order to increase both, we can increase not only the weight but also

the number of seeds. It turns out that doubling the number of seeds provides slightly

better sensitivity. But doubling the number of seeds only increases the expected hits

by a factor of two whereas increasing the weight by one reduces it by a quarter.

Essentially, this is the main reason why multiple spaced seeds are so good. It should

be mentioned that we need more memory for a higher number of seeds in order to

store more hash tables which imposes an upper bound on the number of seeds that

can be used.

52


4.5.1 Multiple spaced seeds for homology search

To perform an efficient homology search, we require efficient multiple seeds but com-

puting good multiple spaced seeds is a hard problem. There are several algorithms

that compute good multiple spaced seeds but the only fast algorithm available is due

to (Ilie et al., 2011; Ilie and Ilie, 2007). Moreover, this algorithm computes the most

sensitive multiple spaced seeds.

Now, we explain in detail the mentioned algorithm. Finding optimal multiple

spaced seeds is NP-hard but even finding good ones is very difficult. Computing

optimal multiple spaced seeds by exhaustive search is infeasible because of two expo-

nential steps in this process:

• There exist exponentially many spaced seeds to be checked and evaluated based

on their sensitivity

• Computing the sensitivity of each spaced seed has exponential time complexity

as well.

In order to compute multiple spaced seeds while spending an acceptable amount of

time, some algorithms have been proposed that handle the exponential nature of the

steps either by reducing the number of spaced seeds to be evaluated or by approximat-

ing the sensitivity of each spaced seed. However, the only available polynomial-time

algorithm is by (Ilie et al., 2011; Ilie and Ilie, 2007) with a time difference that con-

tributes to remarkable consequences in practice. This means if it takes several days

for other algorithms to compute multiple spaced seeds, it takes only several seconds

for the Ilies’ algorithm. In addition, the quality of generated multiple spaced seeds

is not affected by this reduction in time. On the contrary, the obtained multiple

53


spaced seeds have higher sensitivity than those of all the other algorithms. Hence,

not only was it the best choice for our homology search purpose but it also was the

only possibility to compute the high number of large sets of seeds required for the

cross-hybridization evaluations.

4.5.2 Overlap complexity

The algorithm considers a new concept called overlap complexity (OC) measure. This

measure is well correlated with sensitivity but is much easier to compute and takes

polynomial time instead of the exponential time required for sensitivity. Therefore,

instead of sensitivity, it can be employed in computations. Using OC, a polyno-

mial time algorithm can be obtained to compute multiple spaced seeds. The idea

behind the overlap complexity is as follows. Some spaced seeds may have more over-

lapped hits than others even though the number of expected hits of spaced seeds

of the same weight is the same. For instance, consider the following spaced seeds:

“11111111111” and “111*1**1*1**11*111”. As we see in figure 4.2, if 11111111111

has a hit then it will have another one shifted by one position if the next position

is a match which has the probability p, the similarity level. On the other hand, the

spaced seed 111*1**1*1**11*111 needs six new matches in order to have the same

additional hit. This occurs with probability p6, which is noticeably lower. In general,

the difference is even more because of the higher weight.

What this means is that the hits of the seed that overlaps less will be more

uniformly distributed and hence able to hit more alignments. That means, higher

sensitivity. In other words, high sensitivity is obtained by low number of overlapping

hits. The complexity measure introduced to replace sensitivity, overlap complexity,

54


CGTCAAGACTT? |||||||||||? CGTCAAGACTT? 11111111111 11111111111 a. b.

GAG?C??T?G??AC?TTC? |||?|??|?|??||?|||? GAG?C??T?G??AC?TTC? 111*1**1*1**11*111 111*1**1*1**11*111

Figure 4.2: Possibility of overlapping hits for a. consecutive seed. b. spaced seed.

OC, is therefore defined as follows. Suppose we have two spaced seeds s1 and s2.

The overlap complexity of the two spaced seeds, denoted by OC(s1, s2), is formally

defined as:

OC(s1, s2) =

|s1|−1∑i=1−|s2|

2σ[i] (4.1)

where σ[i] denotes the number of pairs of matched 1’s between s2 which is i positions

shifted and s1. The values of i range from 1− |s2| to |s1| − 1, where a negative-value

shift indicates |s2| starts first. To compute the value of σ[i], two variables t1 and t2

are defined as follows:

t1 = ∗|s2|−1s1∗|s2|−1

t2,i = ∗|s2|−1+is2∗|s1|−i−1, for 1− |s2| ≤ i ≤ |s1| − 1

(4.2)

Then,

σ[i] = card{1 ≤ j ≤ |s1|+ 2|s2| − 2, t1[j] = t2,i[j] = 1} (4.3)

To simply understand the overlap complexity of two spaced seeds, consider the

following example. For two spaced seeds, s1 = 1 ∗ 11 and s2 = 1 ∗ 1, t1 = ∗ ∗ 1 ∗ 11 ∗ ∗

and t2,i = ∗2+i1 ∗ 1∗3−i for −2 ≤ i ≤ 3. Figure 4.3 illustrates the alignment of s1

55


against copies of s2 shifted by i positions where i lies in the mentioned range.

σ[i] t1: **1*11** t2,-2: 1*1***** 1 t2,-1: *1*1**** 0 t2,0: **1*1*** 2 t2,1: ***1*1** 1 t2,2: ****1*1* 1 t2,3: *****1*1 1

Figure 4.3: Example of computing overlap complexity for two spaced seeds.

Therefore, OC(1 ∗ 11, 1 ∗ 1) =∑3

i=−2 2σ[i] = 13. The OC is symmetric, which

means OC(s1, s2) = OC(s2, s1), for any pair of spaced seeds s1 and s2. For a set of

multiple spaced seeds S = {s1, s2, . . . , sk} the overlap complexity is defined as:

OC(S) =∑

1≤i≤j≤k

OC(si, sj) (4.4)

where OC(S) is the sum of the overlap complexities of each two seeds in the set.

It is shown by (Ilie and Ilie, 2007) that overlap complexity is experimentally very

well correlated with sensitivity for single spaced seeds. For multiple spaced seeds,

comparing sensitivity and overlap complexity cannot be performed because exhaustive

search is infeasible in this case.

The algorithm for computing the set of optimal multiple space seeds based on the

concept of overlap complexity handles the two exponential steps as follows:

• Sensitivity: replacing the polynomial time overlap complexity approach with

exponential time sensitivity approach

56


• Number of spaced seeds: there are two issues regarding the number of spaced

seeds: first, exponentially many possible choices for the lengths of seeds which is

handled by guessing a set of good lengths; second, exponentially many possible

seeds for the fixed lengths which is handled by repeatedly swapping a 1 with

a * as long as the overlap complexity improves, besides, selecting each swap,

based on a greedy approach, which has the most improvement for results. The

number of swaps in each seed is bounded by the weight of the seeds.

Various sets of variable-length multiple spaced seeds of different weights have been

generated for our oligonucleotide probe design algorithm using the Ilie’s algorithm,

discussed above.

Cross-hybridization assessment of the proposed method for oligo probe design is

performed in two homology search phases: fast and intensive. In the fast phase of

homology search, the proposed algorithm tries to quickly eliminate the non-candidate

oligo positions as much as it can. This is an essential and very efficient phase for al-

leviating the intensive homology search especially for large input data sets of target

sequences such as the human genome. The intensive phase of homology search is

performed at the end of the algorithm to make sure that the final oligo probe candi-

dates have totally been checked, tested, and verified against all other positions for any

specificity. Both fast and intensive homology search phases are described in detail as

follows.

4.5.3 Fast homology search

To perform the fast homology search phase for cross-hybridization assessment, we use

the set of eight multiple spaced seeds of weight 10 in Figure 4.4.

57


111*1**11*1*111 11**1*1**1***1***11*11 1111***1***1****11*1*1 11**11****1*1***1*11*1 111**1***1**1**1****111 11*1**1*******1*1**1***111 11*1***1****1*****1***1**111 11*1*1*****1*****1*****11*11

Figure 4.4: The set of multiple spaced seeds used in the fast homology search phase.

For each of the eight spaced seeds, a hash table is considered that handles collisions

using the linear probing technique in the following way. The algorithm starts from

the right end of the input vector and screens the vector to the left end. At each

position in the input vector, the algorithm computes the corresponding integer value

of the position according to the seed model. Here, finding the hits is equivalent to

finding the same integer values. After computing the integer value for a position in

the input vector, the algorithm searches for the integer value in the hash table and

inserts the hash value and the related position into it if the hash value does not exist

in the table. If the integer value is available in the hash table, which indicates a hit,

the proposed algorithm extends the positions related to the hit from both left and

right up to the predefined oligo length and checks for sequence identity parameter.

Both positions will be eliminated and considered as non-candidates if the similarity

level of two extended regions is more than the predefined threshold and this process

will be continued by sliding the extended regions one position to the left or right till

the threshold condition is satisfied. The default value of this threshold is 75%. Figure

4.5 shows how the fast homology search phase works. This process is repeated for

58


each spaced seed.

j i GCNTACACGTCACCATCTGTGCCACCACACATGTCTCTAGTGATCCCTCATAAGTTCCAACAAATGTGCCACCTCGCATACACCAACATTGATGGGC 111*1**11*1*111

Key Pos

.

.

.

.

.

.

990925075 i

.

.

.

.

.

.

j i GCNTACACGTCACCATCTGTGCCACCACACATGTCTCTAGTGATCCCTCATAAGTTCCAACAAATGTGCCACCTCGCATACACCAACATTGATGGGC

Figure 4.5: Fast homology search.

To check the maximum consecutive match parameter, or stretches of identical

bases, the above technique is applied using a consecutive seed of length equal to the

threshold for the stretch’s length which is by default 15. In addition, after finding the

hits, all oligo positions containing those stretches will be eliminated and considered

as non-candidates.

4.5.4 Intensive homology search

To perform the intensive homology search phase, first all hash tables are created based

on the set of eight spaced seeds of weight nine in Figure 4.6.

The hash tables are created by: starting from the right end of the input vector

and screening to the left end, calculating the integer value for each position based on

59


111*1***11*1*11 11***1**1****1***1*111 1*11**1*1***1****1**11 1*1*11****11******1*11 11*1******1***1*1***111 11***1*1*******1**1*1*11 11*1*****1**1******1****111 111**1******1****1*****1**11

Figure 4.6: The set of multiple spaced seeds used in intensive homology search phase.

the seed’s model, and inserting the integer value as a hash key and its corresponding

position into the hash table; if the hash key is already available in the hash table, the

algorithm only inserts the corresponding position to the array of positions for that

hash key.

After creating all hash tables, the algorithm deeply checks the specificity of an

oligo candidate position against all other possible positions in the hash tables. In other

words, the algorithm first computes the related integer value of the oligo candidate

position, and then searches the hash value in hash tables. If the hash value is available

in the tables, all related positions of that hash key are considered for extension and

Kane’s similarity check.

The difference between the fast and intensive phases is that for the fast phase

we keep only one position for each hash key that is the last seen position of a hit in

screening from right to left, while in the intensive phase we keep all positions and

their corresponding hash values in the hash table.

60


4.6 GC-content evaluation

GC-content is the percentage of guanine (G) and cytosine (C) in a DNA sequence. For

the GC-content evaluation, the proposed algorithm considers a user-defined range that

is determined by minimum and maximum GC-content thresholds. In the proposed

algorithm, the default values for minimum and maximum GC-content thresholds are

30% and 70%, respectively.

...AAACGCATTAGCACCACCATTACCACCACCATCACCACCACCATCACCATTACCATTACCACAGGTAACG

30 ≤ GC-‐Content

≤ 70 ?

NO Eliminate ✗

YES Candidate ✓

...AAACGCATTAGCACCACCATTACCACCACCATCACCACCACCATCACCATTACCATTACCACAGGTAACG


≤ 70 ?

NO Eliminate ✗

YES Candidate ✓

✓

CCGCGCGCGCCTGCAACCAACAGTACCACCAGGCATGCCACACACGGCCCGAGCTA...


≤ 70 ?

NO Eliminate ✗

YES Candidate ✓

✗✓✗✓✓✓...

.

.

.

Figure 4.7: GC-content evaluation process.

As shown in Figure 4.7, GC-content evaluation is carried out by starting from the

right end of the input vector, screening to the left end, checking the percentage of G

and C, and eliminating the oligo candidate positions whose GC-content percentage is

not in the predefined-range.

61


4.7 Melting temperature management

The melting temperature, Tm, of an oligo duplex is the temperature at which the

oligo is 50% bound to its complement. This means that 50% of the molecules are

single-stranded while 50% of the molecules are in the double-stranded form.

To achieve maximum uniformity for the set of oligonucleotides, a small difference

between melting temperatures of oligonucleotides is required. Several approaches are

available for computing the melting temperature. The most commonly used approach

is applying the Nearest Neighbour model with either the parameters from SantaLucia

(1998) or Rychlik et al. (1990).

The proposed algorithm employs the following formula to calculate the Tm:

Tm =∆H

∆S +R ln(C/4)− 273.15 + 12 log[Na+] (4.5)

where ∆H (k.cal/mol), or enthalpy, is the total energy exchange between the system

and its surrounding environment, ∆S (cal/mol.K), or entropy, denotes the energy

spent by the system to organize itself, R = 1.987 (cal/mol.K) is the ideal gas constant,

C is the molar concentration of the oligo, and [Na+] is the molar concentration of

salt.

∆H and ∆S are obtained from the Nearest Neighbor model and the thermody-

namic parameters summarized in Table 4.3. ∆H and ∆S are sum of all nearest

neighbor stacks or doublets including end interactions as well, that is:

∆H = ∆Hend +∑ij

Nij∆Hij

∆S = ∆Send +∑ij

Nij∆Sij

(4.6)

62


where Nij is the number of times the specific nearest-neighbor stack occurs in the

duplex sequence.

Table 4.3: Nearest-Neighbor parameters for DNA/DNA duplexes (SantaLucia, 1998).

Stack (5′3′/3′5′) ∆H (kcalmol

) ∆S ( calmol.K

)AA/TT -7.9 -22.2AT/TA -7.2 -20.4TA/AT -7.2 -21.3CA/GT -8.5 -22.7GT/CA -8.4 -22.4CT/GA -7.8 -21.0GA/CT -8.2 -22.2CG/GC -10.6 -27.2GC/CG -9.8 -24.4GG/CC -8.0 -19.9Init. w/term. G/C 0.1 -2.8Init. w/term. A/T 2.3 4.1Symmetry correction 0 -1.4

After calculating the melting temperature for the remaining oligo candidate posi-

tions, the proposed algorithm performs the following approach to reach the maximum

uniformity for the set of oligonucleotides. By considering a user-defined length of in-

terval for oligos Tm (default is 10 degrees), the algorithm tries to find the optimum

interval in which the maximum number of target sequences are covered. To do this,

the algorithm keeps a record of all possible melting temperatures for each target se-

quence in all possible intervals of the predefined range and then picks the interval

including the highest number of target sequences. At the end, oligo positions whose

melting temperatures are not in this interval will be eliminated as non-candidate

positions.

63


4.8 Secondary structure assessment

Forming secondary structures can affect the sensitivity of oligonucleotides probes by

dramatically decreasing the ability to anneal with target sequences. Examples of

forming secondary structures such as hairpins and dimers are shown in Figure 4.8.

AAAAAAAAAAAAAAAAAAAAA3’ CAGATCAG || ||| T 5’ CATCTGTGCCACCACACATGTCTCGAGTG

AAAAAAAAAAAAAAAAAAAAA3’ CAGATCAG |||||| 5’ CATCTGTGCCACCACACATGTCTCTAGTT

stem

loop

3’ TCTGACCTCAGATCTGTAC 5’ |||||||||||||||| ||||| 5’ CATGTCTAGACTCCAGTCT 3’

b. Dimer

a. Hairpin

3’ CCCTGATCAGCTCGGCA 5’ ||||||||||||||||||| 5’ ACGGCTCGACTAGTCCC 3’

Figure 4.8: Examples of forming secondary structures a. hairpin b. dimer.

The secondary structure assessment can be performed in two ways; the first way

is self-complementary (self-annealing) checking by aligning the oligonucleotide with

its reverse-complement sequence; the other one is thermodynamic calculations to

determine the stability of potential secondary structures. In the proposed algorithm,

64


the first approach will be used because the goal is to realize if any secondary structure

is probable not to identify the best secondary structure.

Because the probability of forming secondary structures is very high in regions of

significant self-complementarity, the secondary structure assessment step is used to

avoid designing the oligos that form stable secondary structure. Oligo candidates are

examined for potential stem-loop structures by searching for short stretches of com-

plementary sequences, called stems, that are separated by few bases, called the loop.

Also, all oligo candidates are checked for ability to form dimers which is performed

by checking the number of base-pairing nucleotide bases within a specific window

size(Figure 4.8). All parameters related to hairpins and dimers are specified by the

user. The default value of stems is six and the minimum and maximum length of the

loop are one and three respectively. For dimers, the default values for window length

and stringency are 15 and 13 respectively.

The proposed algorithm picks the “safest position” for each target sequence after

all elimination from previous steps is done. This position is the middle position of

the largest interval of consecutive candidate positions and is considered as the first

candidate for the final oligo candidate. The final oligo candidate can be also selected

according to their position in the target sequence (proximity to 5′ or 3′). Then,

the secondary structure assessment is performed and all surviving oligo candidates

for each target sequence are considered for the final intensive homology search and

selection. Figure 4.9 shows the summarized version of the proposed algorithm.

65


Algorithm OligoDesign(G)- given: the set G of k input target sequences: G = {g1, g2, . . . , gk}- returns: a set O of m optimal oligonucleotides: O = {o1, o2, . . . , om}

1. V = Merge(G)2. C = Code(V ) // A:00, C:01, G:10, T:113. FastHomology(C, 1maxMatch) // checking for max consecutive match4. for i = 1 to |Seed1| do5. FastHomology(C, Seed1[i]) // checking for sequence identity6. GCcontent(C,minGC ,maxGC)7. MeltTemp(C, |range|)8. for i = 1 to |Seed2| do9. hi = Hash(C, Seed2[i])10. O = ∅11. for i = 1 to k do12. oi = ε13. while (oi = ε and ∃ candidate ∈ [1..|gi|]) do14. SecStruct(candidate)15. for j = 1 to |Seed2| do16. IntensiveHolomoly(candidate, hj)17. O = O + (oi = candidate)18. end while19. end for20. return (O)

Figure 4.9: The OligoDesign algorithm.

66

Chapter 5

Experimental results and

evaluation

This chapter presents the experimental results as well as the comparison of the pro-

posed method with other oligo probe design algorithms.

5.1 Experimental results

The proposed algorithm is implemented entirely in C/C++ and runs on many plat-

forms such as Linux/Unix, Mac OS, and Windows.

The input data sets used for experimental results are listed in Table 5.1. All input

data sets are in fasta format in which the beginning of a new sequence is denoted

by a ‘>’ sign following by a title to specify the sequence. The input DNA sequences

may contain ambiguity characters, denoted by N, but they will not be included in de-

signed oligo probe. In other words, all possible oligo probes containing the ambiguity

characters will be masked and considered as non-candidate oligo positions. So, only

67


those oligo probes will be selected that contain the characters A, C, G, T (uppercase

and lowercase are accepted).

Table 5.1: Input data sets used for experimental results.

Species model Gene set size Descriptionmousenervous 1,421 genes 4,354,947 bp 1421 genes known to be involved in the

development of the mouse nervous systemecoli 5,317 genes 4,843,471 bp Escherichia coli O157:H7 (E. coli) gene se-

quencesbee 11,324 genes 6,010,949 bp Apis (bee) EST sequences

yeast 6,702 genes 9,074,997 bp Yeast CDS sequences

plasmodium 9,518 genes 10,739,506 bp Plasmodium falciparum sequences

zebrafish 12,238 genes 23,003,650 bp Zebrafish mRNA sequences

drosophila 18,962 genes 32,198,758 bp Drosophila melanogaster complete CDSsequences

chicken 26,236 genes 32,732,911 bp Gallus gallus mRNA for DNA topoiso-merase I, complete cds

celegans 30,935 genes 34,753,016 bp C.elegans EST sequences

arabidopsis 28,952 genes 36,298,530 bp Arobidopsis thaliana complete CDS se-quences

maize 58,579 genes 38,963,590 bp Maize EST sequences (release 15 data set)

mouse 35,284 genes 68,604,317 bp Mouse cDNA sequences

human 28,205 genes 72,720,516 bp Homo sapiens small muscle protein, X-linked (SMPX), mRNA

mouserna 36,598 genes 93,830,285 bp Mus musculus transcript sequence file (re-lease 21)

rice 66,710 genes 113,204,455 bp Rice cDNA gene set

The default parameters used in the implementation are listed below.

• oligo length: 50 bases (b)

• maximum consecutive match: 15 b

• maximum sequence similarity: 75%

68


• minimum GC-content: 30%, maximum GC-content: 70%

• oligo concentration: 1 µM, salt concentration: 75 mM

• melting temperature interval length: 10

• hairpin stem: 6 b, maximum hairpin loop: 3 b, minimum hairpin loop: 1 b

• dimer window length: 15 b, stringency: 13 b

The systems used for running the proposed algorithm have the following specifi-

cation:

• CPU: Intel R© CoreTM i7-2600 CPU @ 3.40GHz

– Number of Cores = 4

– Number of Threads = 8

– Clock Speed = 3.4 GH

– Cache Size = 8 MB

• Physical Memory: 16 GB

• OS: GNU Linux version 2.6.38.8-desktop-69mib

• Compiler: gcc version 4.4.3

The proposed algorithm has been also parallelized using openMP, an Application

Program Interface (API) that may be used to explicitly direct multi-threaded, shared

memory parallelism.

Tables 5.2-5.8 show the results of running the proposed algorithm by mentioned

default parameters and the various set of multiple spaced seeds of different weight for

69


fast and intensive homology search in Figures 5.1-5.7. For most scenarios, we have

employed two different sets of spaced seeds for fast and intensive homology search

steps. However, it should be mentioned that the proposed algorithm can be run

only by performing fast homology search or double fast homology search for cross-

hybridization assessment which results in less running time but less oligo quality as

well. For fast homology search step three sets of spaced seeds have been tested which

are: The set of eight multiple spaced seeds of weight 10, denoted by s8w10, the set

of eight multiple spaced seeds of weight 11, denoted by s8w11, and the set of eight

multiple spaced seeds of weight 8, denoted by s8w8. For intensive homology search

step three different sets of spaced seeds have also been tested which are: The set of

eight multiple spaced seeds of weight 9, denoted by s8w9, the set of eight multiple

spaced seeds of weight 8, denoted by s8w8, and the set of 16 multiple spaced seeds

of weight 9, denoted by s16w9. As we see in Tables 5.2-5.8, different seed models for

fast and intensive homology search steps generate different number of oligonucleotide

probes and have different running time and memory usage. The best running time and

memory usage is for the case that only fast homology search is applied, but the quality

of generated oligo probes are not good and some of them do not satisfy the sequence

similarity parameter. The combination of eight seeds of weight 10 (s8w10) for fast

homology search step and eight seeds of wight nine (s8w9) for intensive homology

search step is selected as the best combination for comparison with other oligo design

algorithms which will be presented in the next section. Moreover, the implementation

of the algorithm allows it to run on any system by changing the seed specification,

that is, if the physical memory of the system is not enough for the combination of

s8w10-s8w9 then smaller seed sets will be selected to perform the intensive homology

70


search.

111*1***11*1*11 11***1**1****1***1*111 1*11**1*1***1****1**11 1*1*11****11******1*11 11*1******1***1*1***111 11***1*1*******1**1*1*11 11*1*****1**1******1****111 111**1******1****1*****1**11

111*1**11*1*111 11**1*1**1***1***11*11 1111***1***1****11*1*1 11**11****1*1***1*11*1 111**1***1**1**1****111 11*1**1*******1*1**1***111 11*1***1****1*****1***1**111 11*1*1*****1*****1*****11*11

Fast Homology Search Intensive Homology Search

Figure 5.1: Seeds used for fast homology search (s8w10) and for intensive homologysearch (s8w9).

1*1****1***1*****1*111 1*11*1******1**1***1*1 11***1*1*****1**1*1**1 11**1*****1****1**1****11 1*1***1******11***1****11 111******1*****1******1***11 11*1*******1********1***1*11 11***1******1*******1**1*1*1

111*1**11*1*111 11**1*1**1***1***11*11 1111***1***1****11*1*1 11**11****1*1***1*11*1 111**1***1**1**1****111 11*1**1*******1*1**1***111 11*1***1****1*****1***1**111 11*1*1*****1*****1*****11*11



5.2 Results from other algorithms

To compare our algorithm with the algorithms for oligo probe design, a comprehensive

survey on the available software programs has been performed and the best algorithms

71


Table 5.2: Results by employing s8w10-s8w9.

Data set Target sequence Length (bp) Number of oligos Time (sec) Space (MB)mousenervous 1,421 4,354,947 1418 4 353ecoli 5,317 4,843,471 4,647 4 374bee 11,324 6,010,949 10,675 6 428yeast 6,702 9,074,997 6,178 9 569plasmodium 9,518 10,739,506 4527 14 658zebrafish 12,238 23,003,650 7,989 31 1,211drosophila 18,962 32,198,758 11,826 34 1,636chicken 26,236 32,732,911 16,692 38 1,661celegans 30,935 34,753,016 21,724 44 1,755arabidopsis 28,952 36,298,530 21,326 45 1,826maize 58,579 38,963,590 43,614 59 1,952mouse 35,284 68,604,317 20,200 99 3,321human 28,205 72,720,516 18,781 109 3,511mouserna 36,598 93,830,285 17,963 151 4,477rice 66,710 113,204,455 28,552 215 5,375


Data set Target sequence Length (bp) Number of oligos Time (sec) Space (MB)mousenervous 1,421 4,354,947 1418 4 753ecoli 5,317 4,843,471 4,647 4 753bee 11,324 6,010,949 10,675 9 756yeast 6,702 9,074,997 6,178 12 763plasmodium 9,518 10,739,506 4527 28 767zebrafish 12,238 23,003,650 7,989 40 1,217drosophila 18,962 32,198,758 11,826 61 1,639chicken 26,236 32,732,911 16,692 70 1,665celegans 30,935 34,753,016 21,724 90 1,758arabidopsis 28,952 36,298,530 21,326 90 1,830maize 58,579 38,963,590 43,614 134 1,961mouse 35,284 68,604,317 20,200 218 3,335human 28,205 72,720,516 18,781 238 3,522mouserna 36,598 93,830,285 17,963 336 4,466rice 66,710 113,204,455 28,552 523 5,373

72


11**11*1****11**11 111***1**1**11*1*1 11**11****11**11*1 1*1*1*11***1**1*11 1*11*1****1*1*1*11 11*1***1*1*1***111 1*1*1**1*11****1*11 11*11*****1**1*1**11 111****1***1***1**111 1*1**1**1***11****111 111**1****1**1*****1*11 11*1*****1******1**1*111 111***1*******1***1**11*1 11**1*1********1****1**111 11*1***1***1********1*1*11 111*1****1*********1*****111

111*1**11*1*111 11**1*1**1***1***11*11 1111***1***1****11*1*1 11**11****1*1***1*11*1 111**1***1**1**1****111 11*1**1*******1*1**1***111 11*1***1****1*****1***1**111 11*1*1*****1*****1*****11*11




Data set Target sequence Length (bp) Number of oligos Time (sec) Space (MB)mousenervous 1,421 4,354,947 1418 6 697ecoli 5,317 4,843,471 4,647 7 739bee 11,324 6,010,949 10,675 11 844yeast 6,702 9,074,997 6,178 14 1,119plasmodium 9,518 10,739,506 4527 27 1294zebrafish 12,238 23,003,650 7,989 53 2,375drosophila 18,962 32,198,758 11,826 63 3,204chicken 26,236 32,732,911 16,692 71 3,252celegans 30,935 34,753,016 21,724 84 3,435arabidopsis 28,952 36,298,530 21,326 85 3,574maize 58,579 38,963,590 43,614 115 3,820mouse 35,284 68,604,317 20,200 193 6,492human 28,205 72,720,516 18,781 215 6,864mouserna 36,598 93,830,285 17,963 294 8,754rice 66,710 113,204,455 28,552 422 10,509

73


1*1*1*1****11***11 11****11***1**11*1 1*1**1**1****111*1 1**11**11******1*11 111****1**1***1*1*1 11*1**1***1**1****11 111******1***1*****1*11 11*1***1*************1*1*11

111*1**11*1*111 11**1*1**1***1***11*11 1111***1***1****11*1*1 11**11****1*1***1*11*1 111**1***1**1**1****111 11*1**1*******1*1**1***111 11*1***1****1*****1***1**111 11*1*1*****1*****1*****11*11

Fast Homology Search Fast Homology Search

Figure 5.4: Seeds used for double fast homology search (s8w10) and (s8w8).


Data set Target sequence Length (bp) Number of oligos Time (sec) Space (MB)mousenervous 1,421 4,354,947 1418 2 309ecoli 5,317 4,843,471 4,649 3 310bee 11,324 6,010,949 10,678 4 313yeast 6,702 9,074,997 6,198 6 319plasmodium 9,518 10,739,506 4554 4 323zebrafish 12,238 23,003,650 8,052 12 349drosophila 18,962 32,198,758 11,840 16 369chicken 26,236 32,732,911 16,737 22 370celegans 30,935 34,753,016 22,041 17 940arabidopsis 28,952 36,298,530 22,171 17 943maize 58,579 38,963,590 44,646 21 949mouse 35,284 68,604,317 20,516 28 1,013human 28,205 72,720,516 19,075 31 1,022mouserna 36,598 93,830,285 18,210 41 1,067rice 66,710 113,204,455 29,567 47 1,109

74


111*1**11*1*111 11**1*1**1***1***11*11 1111***1***1****11*1*1 11**11****1*1***1*11*1 111**1***1**1**1****111 11*1**1*******1*1**1***111 11*1***1****1*****1***1**111 11*1*1*****1*****1*****11*11

Figure 5.5: Seeds used for fast homology search (s8w10).

Table 5.6: Results by employing s8w10.

Data set Target sequence Length (bp) Number of oligos Time (sec) Space (MB)mousenervous 1,421 4,354,947 1418 2 309ecoli 5,317 4,843,471 4,649 2 310bee 11,324 6,010,949 10,678 2 312yeast 6,702 9,074,997 6,198 4 319plasmodium 9,518 10,739,506 4554 2 323zebrafish 12,238 23,003,650 8,053 10 349drosophila 18,962 32,198,758 11,841 12 369chicken 26,236 32,732,911 16,738 13 936celegans 30,935 34,753,016 22,046 13 940arabidopsis 28,952 36,298,530 22,180 14 943maize 58,579 38,963,590 44,656 17 949mouse 35,284 68,604,317 20,535 23 1,013human 28,205 72,720,516 19,081 25 1,022mouserna 36,598 93,830,285 18,210 34 1,067rice 66,710 113,204,455 29,604 39 1,109

75


111*1***11*1*11 11***1**1****1***1*111 1*11**1*1***1****1**11 1*1*11****11******1*11 11*1******1***1*1***111 11***1*1*******1**1*1*11 11*1*****1**1******1****111 111**1******1****1*****1**11

1*111*1*1***111*11 11*1***11*111**111 1111**11**1**1*1*11 111**1*1***1**1**1111 11**1*1*11*1*****1*111 111*1***1****11***1*111 111*1**1*******1***1**1*111 11*11*1*****1*******1*1**111




Data set Target sequence Length (bp) Number of oligos Time (sec) Space (MB)mousenervous 1,421 4,354,947 1418 3 753ecoli 5,317 4,843,471 4,646 3 754bee 11,324 6,010,949 10,677 6 756yeast 6,702 9,074,997 6,178 11 763plasmodium 9,518 10,739,506 4527 16 767zebrafish 12,238 23,003,650 7989 42 1,211drosophila 18,962 32,198,758 11,826 55 1,636chicken 26,236 32,732,911 16,692 57 1,661celegans 30,935 34,753,016 21,724 64 1,755arabidopsis 28,952 36,298,530 21,331 69 1,826maize 58,579 38,963,590 43,614 94 1,952mouse 35,284 68,604,317 20,200 149 3,321human 28,205 72,720,516 18,781 160 3,511mouserna 36,598 93,830,285 17,963 218 4,477rice 66,710 113,204,455 28,556 294 5,375

76


11**11*1****11**11 111***1**1**11*1*1 11**11****11**11*1 1*1*1*11***1**1*11 1*11*1****1*1*1*11 11*1***1*1*1***111 1*1*1**1*11****1*11 11*11*****1**1*1**11 111****1***1***1**111 1*1**1**1***11****111 111**1****1**1*****1*11 11*1*****1******1**1*111 111***1*******1***1**11*1 11**1*1********1****1**111 11*1***1***1********1*1*11 111*1****1*********1*****111


1*111*1*1***111*11 11*1***11*111**111 1111**11**1**1*1*11 111**1*1***1**1**1111 11**1*1*11*1*****1*111 111*1***1****11***1*111 111*1**1*******1***1**1*111 11*11*1*****1*******1*1**111



Data set Target sequence Length (bp) Number of oligos Time (sec) Space (MB)mousenervous 1,421 4,354,947 1418 6 753ecoli 5,317 4,843,471 4,646 7 754bee 11,324 6,010,949 10,675 10 844yeast 6,702 9,074,997 6,178 15 1,119plasmodium 9,518 10,739,506 4,527 28 1,294zebrafish 12,238 23,003,650 7,989 61 2,175drosophila 18,962 32,198,758 11,826 81 3,204chicken 26,236 32,732,911 16,692 88 3,552celegans 30,935 34,753,016 21,724 95 3,435arabidopsis 28,952 36,298,530 21,331 106 3,574maize 58,579 38,963,590 43,614 145 3,820mouse 35,284 68,604,317 20,200 231 6,492human 28,205 72,720,516 18,781 259 6,864mouserna 36,598 93,830,285 17,963 352 8,754rice 66,710 113,204,455 28,556 488 10,509

77


have been selected for this purpose. Of the algorithms in chapter 4, those freely avail-

able and executable for comparison purposes are: ArrayOligoSelector, OligoArray,

OligoPicker, OligoWiz, PICKY, and YODA. All programs were run with the same

parameters so that the comparison makes sense; the length of generated oligo probes

for all algorithms is set to 50; Maximum consecutive match and maximum sequence

similarity percentage are set to 15 and 75 respectively; the GC-content range is set

to [30,70]. Table 5.9 shows the description of software programs used in the evalu-

ation and comparison process. The results of these algorithms are shown in Tables

5.10-5.15. It should be noted that all programs have been run on the same machine

with the specifications mentioned in the beginning of this chapter.

Table 5.9: Description of the oligo design software programs used in comparison.

Algorithm Organism Specificity Availability User Platform Programmingbank interface language

AOS No limit fasta file Free Command line L Python

OAR No limit fasta file Free Command line L/W/M Javaand GUI

OPR No limit fasta file Free Command line L Perl

OWZ Limited fasta file Free Command line W/L/M Perl, Javaand GUI

PKY No limit fasta file By request GUI L/W/M C++

YDA No limit fasta file Free Command line W/L/M Javaand GUI

The abbreviation used for algorithm AOS: ArrayOligoSelector, OAR: OligoArray, OPR: OligoPicker,OWZ: OligoWiz, PKY: PICKY, YDA: YODA. In platform column, L: Linux, W: Windows, M:Macintosh.

78


Table 5.10: Results for ArrayOligoSelector.

Data set Target sequence Length (bp) Number of oligos Time (h:m:s) Space (MB)mousenervous 1,421 4,354,947 1,421 00:12:56 NAecoli 5,317 4,843,471 5,161 00:20:13 NAbee 11,324 6,010,949 11,317 00:52:12 NAyeast 6,702 9,074,997 6,645 00:40:23 NAplasmodium 9,518 10,739,506 8,991 01:14:50 NAzebrafish 12,238 23,003,650 8,481 01:52:44 NAdrosophila 18,962 32,198,758 16,501 02:13:42 NAchicken 26,236 32,732,911 26,036 02:50:22 NAcelegans 30,935 34,753,016 30,788 03:59:10 NAarabidopsis 28,952 36,298,530 27,918 03:03:37 NAmaize 58,579 38,963,590 58,522 06:33:49 NAmouse 35,284 68,604,317 34,491 07:44:29 NAhuman 28,205 72,720,516 27,923 06:08:34 NAmouserna 36,598 93,830,285 34,856 10:21:06 NArice 66,710 113,204,455 66,520 21:09:40 NA

Table 5.11: Results for OligoArray.

Data set Target sequence Length (bp) Number of oligos Time (h:m:s) Space (MB)mousenervous 1,421 4,354,947 1,410 00:05:55 1,300ecoli 5,317 4,843,471 4,503 00:42:49 1,400bee 11,324 6,010,949 10,575 01:01:02 3,200yeast 6,702 9,074,997 6,156 01:26:00 1,300plasmodium 9,518 10,739,506 5,370 13:39:46 3,000zebrafish 12,238 23,003,650 7,573 09:15:09 3,800drosophila 18,962 32,198,758 10,034 14:32:05 3,300chicken 26,236 32,732,911 15,984 17:16:42 2,400celegans 30,935 34,753,016 23,142 21:14:35 3,000arabidopsis 28,952 36,298,530 19,340 14:43:50 3,200maize 58,579 38,963,590 42,423 16:50:12 2,500mouse 35,284 68,604,317 20,353 86:10:59 8,000human 28,205 72,720,516 18,083 24:12:12 6,000mouserna 36,598 93,830,285 18,343 103:49:30 7,900rice 66,710 113,204,455 19,814 110:48:35 10,000

79


Table 5.12: Results for OligoPicker.

Data set Target sequence Length (bp) Number of oligos Time (h:m:s) Space (MB)mousenervous 1,421 4,354,947 1,421 00:01:06 NAecoli 5,317 4,843,471 4,672 00:04:20 NAbee 11,324 6,010,949 10,823 00:09:54 NAyeast 6,702 9,074,997 6,249 00:09:00 NAplasmodium 9,518 10,739,506 5,964 00:15:25 NAzebrafish 12,238 23,003,650 8,226 00:24:38 NAdrosophila 18,962 32,198,758 12,245 00:39:03 NAchicken 26,236 32,732,911 17,485 00:49:55 NAcelegans 30,935 34,753,016 24,086 01:23:47 NAarabidopsis 28,952 36,298,530 23,687 02:19:47 NAmaize 58,579 38,963,590 49,475 04:28:43 NAmouse 35,284 68,604,317 23,779 02:46:37 NAhuman 28,205 72,720,516 21,410 02:18:27 NAmouserna 36,598 93,830,285 21,757 02:49:55 NArice 66,710 113,204,455 38,692 08:38:26 1,754

Table 5.13: Results for OligoWiz.

Data set Target sequence Length (bp) Number of oligos Time (h:m:s) Space (MB)mousenervous 1,421 4,354,947 1,421 00:54:54 NAecoli 5,317 4,843,471 5,317 00:32:29 NAbee 11,324 6,010,949 11,324 01:46:36 NAyeast 6,702 9,074,997 6,702 01:01:01 NAplasmodium 9,518 10,739,506 9,517 11:37:08 NAzebrafish 12,238 23,003,650 12,238 13:04:35 NAdrosophila 18,962 32,198,758 18,962 06:19:40 NAchicken 26,236 32,732,911 26,235 05:34:33 NAcelegans 30,935 34,753,016 30,935 05:42:52 NAarabidopsis 28,952 36,298,530 28,952 05:17:28 NAmaize 58,579 38,963,590 58,579 08:36:26 NAmouse 35,284 68,604,317 35,283 28:53:23 NAhuman 28,205 72,720,516 28,205 24:08:28 NAmouserna 36,598 93,830,285 36,585 28:53:07 416rice 66,710 113,204,455 66,710 69:26:16 467

80


Table 5.14: Results for PICKY.

Data set Target sequence Length (bp) Number of oligos Time (sec) Space (MB)mousenervous 1,421 4,354,947 1,421 30 101ecoli 5,317 4,843,471 4,557 47 106bee 11,324 6,010,949 10,442 62 133yeast 6,702 9,074,997 5,850 121 194plasmodium 9,518 10,739,506 4,138 43 230zebrafish 12,238 23,003,650 7,428 345 506drosophila 18,962 32,198,758 10,484 493 671chicken 26,236 32,732,911 13,781 422 685celegans 30,935 34,753,016 16,807 393 725arabidopsis 28,952 36,298,530 18,584 436 756maize 58,579 38,963,590 26,506 442 814mouse 35,284 68,604,317 12,473 457 1,419human 28,205 72,720,516 10,807 392 1,507mouserna 36,598 93,830,285 9,483 445 1,938rice 66,710 113,204,455 13,365 617 2,298

Table 5.15: Results for YODA.

Data set Target sequence Length (bp) Number of oligos Time (h:m:s) Space (MB)mousenervous 1,421 4,354,947 1,418 00:20:06 819ecoli 5,317 4,843,471 4,590 00:33:54 918bee 11,324 6,010,949 10,526 01:08:53 768yeast 6,702 9,074,997 6,128 02:30:23 1,300plasmodium 9,518 10,739,506 3,885 01:02:25 515zebrafish 12,238 23,003,650 7,855 08:19:16 770drosophila 18,962 32,198,758 11,613 10:32:47 1,500chicken 26,236 32,732,911 16,306 12:20:47 1,100celegans 30,935 34,753,016 20,941 25:25:43 1,000arabidopsis 28,952 36,298,530 20,318 80:50:12 683maize 58,579 38,963,590 41,198 57:03:36 1,200mouse 35,284 68,604,317 19,399 37:40:05 933human 28,205 72,720,516 17,997 33:58:01 1,200mouserna 36,598 93,830,285 17,267 39:13:10 1,700rice 66,710 113,204,455 25,891 127:10:27 1,900

81


5.3 Evaluation and comparison

To evaluate and compare all mentioned algorithms with the our proposed algorithm,

we have developed a separate program. This program employs eight highly sensitive

multiple spaced seeds of weight six shown in Figure 5.8.

111*1*11 11**1**111 11*1*****1*11 1*1***1****111 1*1**1**1***1*1 11***1****1**1*1 11*1******1***1*1 11******1****1**11

Figure 5.8: Eight spaced seeds of weight six used in the evaluation program.

The final set of oligo probes designed by each algorithms is considered as the

input of the evaluation program. The program works as follows. For each oligo

probe, using eight hash tables from seeds, it looks for the similar region between the

oligo probe and all possible regions in the whole data set (that is, the vector of joined

input target sequences) and consequently checks the sequence similarity percentage

parameter, i.e. 75%. Oligo probes that have more than 75% similarity are considered

bad oligos. The evaluation program is slow because the multiple spaced seeds of

weight six cause many hits to be checked for the sequence similarity parameter. All

evaluations were performed on sharcnet ( http://www.sharcnet.ca). We have used

the following cluster for our evaluation task: kraken, orca, hound, bramble, and silky.

The comparison of algorithms on data sets are shown in Tables 5.16-5.30.

82


Table 5.16: Evaluation for mousenervous.

Algorithm Total Bad Good % good Time Spaceoligo oligo oligo oligo (sec) (MB)

ArrayOligoSelector 1,421 174 1,247 87.76 776 NAOligoArray 1,410 315 1,095 77.65957447 355 1,300OligoPicker 1,421 17 1,404 98.80 66 NAOligoWiz 1,421 144 1,277 89.87 3,294 NAPICKY 1,421 36 1,385 97.47 30 101YODA 1,418 0 1,418 100 1,206 819Proposed method 1,418 0 1,418 100 4 353

Table 5.17: Evaluation for ecoli.


ArrayOligoSelector 5,161 647 4,514 87.46 1,213 NAOligoArray 4,503 1,124 3,379 75.04 2,569 1,400OligoPicker 4,672 145 4,527 96.90 260 NAOligoWiz 5,317 2,309 3,008 56.57 1,949 NAPICKY 4,557 65 4,492 98.57 47 106YODA 4,590 16 4,574 99.65 2,034 918Proposed method 4,647 0 4,647 100 4 374

Table 5.18: Evaluation for bee.


ArrayOligoSelector 11,317 1,359 9,958 87.99 3,132 NAOligoArray 10,575 4,136 6,439 60.89 3,662 3,200OligoPicker 10,823 354 10,469 96.73 594 NAOligoWiz 11,324 1,544 9,780 86.37 6,396 NAPICKY 10,442 110 10,332 98.95 62 133YODA 10,526 136 10,390 98.71 4,133 768Proposed method 10,675 0 10,675 100 6 428

83


Table 5.19: Evaluation for yeast.


ArrayOligoSelector 6,645 1,922 4,723 71.08 2,423 NAOligoArray 6,156 3,016 3,140 51.01 5,160 1,300OligoPicker 6,249 208 6,041 96.67 540 NAOligoWiz 6,702 875 5,827 86.94 3,661 NAPICKY 5,850 138 5,712 97.64 121 194YODA 6,128 34 6,094 99.45 9,023 1,300Proposed method 6,178 0 6,178 100 9 569

Table 5.20: Evaluation for plasmodium.


ArrayOligoSelector 8,991 7,455 1,536 17.08 4,490 NAOligoArray 5,370 4,225 1,145 21.32 49,186 3,000OligoPicker 5,964 1,671 4,293 71.98 925 NAOligoWiz 9,517 7,811 1,706 17.93 41,828 NAPICKY 4,138 131 4,007 96.83 43 230YODA 3,885 55 3,830 98.58 3,745 515Proposed method 4,527 0 4,527 100 14 658

Table 5.21: Evaluation for zebrafish.


ArrayOligoSelector 8,481 6,215 2,266 26.72 6,764 NAOligoArray 7,573 5,233 2,340 30.90 33,309 3,800OligoPicker 8,226 579 7,647 92.96 1,478 NAOligoWiz 12,238 8,810 3,428 28.01 47,075 NAPICKY 7,428 382 7,046 94.86 345 506YODA 7,855 76 7,779 99.03 29,956 770Proposed method 7,989 0 7,989 100 31 1,211

84


Table 5.22: Evaluation for drosophila.


ArrayOligoSelector 16,501 10,590 5,911 35.82 8,022 NAOligoArray 10,034 5,803 4,231 42.17 52,325 3,300OligoPicker 12,245 802 11,443 93.45 2,343 NAOligoWiz 18,962 10,701 8,261 43.57 22,780 NAPICKY 10,484 304 10,180 97.10 493 671YODA 11,613 163 11,450 98.60 37,967 1,500Proposed method 11,826 0 11,826 100 34 1,636

Table 5.23: Evaluation for chicken.


ArrayOligoSelector 26,036 20,357 5,679 21.81 10,222 NAOligoArray 15,984 12,136 3,848 24.07 62,202 2,400OligoPicker 17,485 1,366 16,119 92.19 2,995 NAOligoWiz 26,235 15,466 10,769 41.05 20,073 NAPICKY 13,781 315 13,466 97.71 422 685YODA 16,306 236 16,070 98.55 44,447 1,100Proposed method 16,692 0 16,692 100 38 1,661

Table 5.24: Evaluation for celegans.


ArrayOligoSelector 30,788 25,057 5,731 18.61 14,350 NAOligoArray 23,142 19,503 3,639 15.72 76,475 3,000OligoPicker 24,086 4,116 19,970 82.91 5,027 NAOligoWiz 30,935 18,617 12,318 39.82 20,572 NAPICKY 16,807 1,012 15,795 93.98 393 725YODA 20,941 406 20,535 98.06 91,543 1,000Proposed method 21,724 0 21,724 100 44 1,755

85


Table 5.25: Evaluation for arabidopsis.


ArrayOligoSelector 27,918 22,641 5,277 18.90 11,017 NAOligoArray 19,340 16,771 2,569 13.28 53,030 3,200OligoPicker 23,687 6,107 17,580 74.22 8,387 NAOligoWiz 28,952 16,635 12,317 42.54 19,048 NAPICKY 18,584 3,447 15,137 81.45 436 756YODA 20,318 271 20,047 98.67 291,012 683Proposed method 21,326 0 21,326 100 45 1,826

Table 5.26: Evaluation for maize.


ArrayOligoSelector 58,522 38,224 20,298 34.68 23,629 NAOligoArray 42,423 30,736 11,687 27.55 60,612 2,500OligoPicker 49,475 9,835 39,640 80.12 16,123 NAOligoWiz 58,579 48,243 10,336 17.64 30,986 NAPICKY 26,506 1,757 24,749 93.37 442 814YODA 41,198 1,285 39,913 96.88 205,416 1,200Proposed method 43,614 0 43,614 100 59 1,952

Table 5.27: Evaluation for mouse.


ArrayOligoSelector 34,491 31,848 2,643 7.66 27,869 NAOligoArray 20,353 18,164 2,189 10.76 310,259 8,000OligoPicker 23,779 5,275 18,504 77.82 9,997 NAOligoWiz 35,283 29,799 5,484 15.54 104,003 NAPICKY 12,473 631 11,842 94.94 457 1,419YODA 19,399 410 18,989 97.89 135,605 933Proposed method 20,200 0 20,200 100 99 3,321

86


Table 5.28: Evaluation for human.


ArrayOligoSelector 27,923 26,288 1,635 5.86 22,114 NAOligoArray 18,083 16,589 1,494 8.26 87,132 6,000OligoPicker 21,410 4,056 17,354 81.06 8,307 NAOligoWiz 28,205 25,119 3,086 10.94 86,908 NAPICKY 10,807 495 10,312 95.42 392 1,507YODA 17,997 163 17,834 99.09 122,281 1,200Proposed method 18,781 0 18,781 100 109 3,511

Table 5.29: Evaluation for mouserna.


ArrayOligoSelector 34,856 33,529 1,327 3.81 37,266 NAOligoArray 18,343 17,209 1,134 6.18 373,770 7,900OligoPicker 21,757 5,428 16,329 75.05 10,195 NAOligoWiz 36,585 34,456 2,129 5.82 103,987 416PICKY 9,483 406 9,077 95.72 445 1,938YODA 17,267 235 17,032 98.64 141,190 1,700Proposed method 17,963 0 17,963 100 151 4,477

Table 5.30: Evaluation for rice.


ArrayOligoSelector 66,520 62,797 3,723 5.60 76,180 NAOligoArray 19,814 17,857 1,957 9.88 398,915 10,000OligoPicker 38,692 14,756 23,936 61.86 31,106 1,754OligoWiz 66,710 60,551 6,159 9.23 249,976 467PICKY 13,365 1,468 11,897 89.02 617 2,298YODA 25,891 704 25,187 97.28 457,827 1,900Proposed method 28,552 0 28,552 100 215 5,375

87


Table 5.31 shows the total results of all algorithms on all data sets.

Table 5.31: Total comparison of the algorithms for all data sets.

Algorithm Total oligo Total Total % total Total Totalgenerated bad good good Time (sec) Space (MB)

ArrayOligoSelector 365,571 289,103 76,468 20.92 249,467 NAOligoArray 223,103 172,817 50,286 22.54 1,568,961 60,300OligoPicker 269,971 54,715 215,256 79.73 98,343 NAOligoWiz 376,965 281,080 95,885 25.44 762,536 NAPICKY 166,126 10,697 155,429 93.56 4,745 12,083YODA 225,332 4,190 221,142 98.14 1,577,385 16,306Proposed method 236,112 0 236,112 100 862 29,107

As we have seen in Tables 5.16-5.30 and according to the Table 5.31, it can be

concluded that the proposed algorithm for designing the optimal set of oligonucleotide

probes identifies and selects more unique and efficient oligos as well as running orders

of magnitude faster than the best available algorithms that have been proposed for

the same task.

88

Chapter 6

Summary and conclusion

The proposed algorithm in this research work provides a tool for researchers in biol-

ogy and medicine for the accurate, rapid, easy, flexible, and free design of signature

oligonucleotides. The oligonucleotides designed by our algorithm can be used in a

wide range of applications in biology or medicine such as: microarray design, PCR

amplification, gene identification, diagnostic tests for genetic diseases like breast can-

cer or cystic fibrosis, diagnostic tests for infectious diseases like hepatitis or AIDS, or

discovering new drugs or treatments for a variety of diseases.

To achieve the maximum specificity, sensitivity, and uniformity for the set of

oligonucleotides, the proposed algorithm tries to truly satisfy related parameters. It

employs the most sensitive sets of multiple spaced seeds to perform the homology

search, which is considered the most important parameter involved in the specificity

of the designed oligo probes. The other parameter affecting the specificity was GC-

content which was implemented by filtering out oligonucleotides whose GC-content

89


was outside the predefined range. For sensitivity, the algorithm tried to avoid de-

signing oligonucleotides forming stable secondary structures such as dimers and hair-

pins which was done by self-complementary verification of all potential regions in an

oligonucleotide. The maximum uniformity for the set of oligonucleotides was achieved

by identifying the optimal interval with a predefined small length in which the melting

temperatures of oligonucleotides are allowed to lie.

The results obtained from the proposed algorithm were compared with the most

well-known algorithms and software programs proposed for the same task. To perform

an independently fair comparison among all algorithms considered in the compari-

son process, a separate program was written that employed highly sensitive multiple

spaced seeds to identify bad designed oligos by algorithms, which had similarity more

than what was allowed to non-target sequences. The comparison illustrated that

our proposed algorithm for oligo design finds and selects unique and more efficient

oligonucleotide as well as running orders of magnitude faster than the other well-

known algorithms.

As future work, it would be a good idea to focus on the thermodynamics of DNA

oligonucleotides and to find more precise formulas to predict the melting temperature

of DNA sequences. Moreover, finding a good parameter for low-complexity region

assessment would be an interesting direction research.

90

Bibliography

Alberts, B., Johnson, A., Lewis, J., Raff, M., Bray, D., Hopkin, K., Roberts, K., and

Walter, P. (2003). Essential Cell Biology, Second Edition. Garland Science/Taylor

& Francis Group.

Altschul, S. F., Gish, W., Miller, W., Meyers, E. W., and Lipman, D. (1990). Basic

local alignment search tool. Journal of Molecular Biology, 215(3), 403–410.

Bozdech, Z., Zhu, J., Joachimiak, M., Cohen, F., Pulliam, B., and DeRisi, J. (2003).

Expression profiling of the schizont and trophozoite stages of plasmodium falci-

parum with a long-oligonucleotide microarray. Genome Biology, 4(2), R9.

Chou, H.-H., Hsia, A.-P., Mooney, D. L., and Schnable, P. S. (2004). Picky: oligo

microarray design for large genomes. Bioinformatics, 20(17), 2893–2902.

Farabee, M. J. (2007). On-Line Biology Book. Estrella Mountain Community College.

Feng, S. and Tillier, E. R. (2007). A fast and flexible approach to oligonucleotide

probe design for genomes and gene families. Bioinformatics, 23(10), 1195–1202.

Godfrey-Smith, P. and Sterelny, K. (2008). Biological information. In E. N. Zalta,

editor, The Stanford Encyclopedia of Philosophy. Fall 2008 edition.

91


Golding, B., Morton, D., and Haerty, W. (2011). Elementary Sequence Analysis.

E-Book text for the course Biology 3S03, Department of Biology, McMaster Uni-

versity, http://helix.biology.mcmaster.ca/3S03.pdf.

Hancock, J. M. and Armstrong, J. S. (1994). Simple34: an improved and enhanced

implementation for vax and sun computers of the simple algorithm for analysis of

clustered repetitive motifs in nucleotide sequences. Computer Applications in the

Biosciences, 10(1), 67–70.

Horspool, D. (2008). An overview of the central dogma of molecular biochemistry

with all unusual flows of information included (in green). In Wikipedia, the Free

Encyclopedia.

Ilie, L. and Ilie, S. (2007). Multiple spaced seeds for homology search. Bioinformatics,

23(22), 2969–2977.

Ilie, L., Ilie, S., and Bigvand, A. M. (2011). Speed: fast computation of sensitive

spaced seeds. Bioinformatics.

Kaderali, L. and Schliep, A. (2002). Selecting signature oligonucleotides to identify

organisms using dna arrays. Bioinformatics, 18(10), 1340–1349.

Kane, M. D., Jatkoe, T. A., Stumpf, C. R., Lu, J., Thomas, J. D., and Madore, S. J.

(2000). Assessment of the sensitivity and specificity of oligonucleotide (50mer)

microarrays. Nucleic Acids Research, 28(22), 4552–4557.

Li, F. and Stormo, G. D. (2001). Selection of optimal dna oligos for gene expression

arrays. Bioinformatics, 17(11), 1067–1076.

92


Li, M., Ma, B., Kisman, D., and Tromp, J. (2004). Patternhunter ii: highly sensitive

and fast homology search. Journal of Bioinformatics and Computational Biology,

2(3), 417–439.

Lipman, D. and Pearson, W. (1985). Rapid and sensitive protein similarity searches.

Science, 227(4693), 1435–1441.

Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S.,

Mittmann, M., Wang, C., Kobayashi, M., Horton, H., and Brown, E. L. (1996). Ex-

pression monitoring by hybridization to high-density oligonucleotide arrays. Nature

Biotechnology, 14, 1675–1680.

Ma, B., Tromp, J., and Li, M. (2002). Patternhunter: faster and more sensitive

homology search. Bioinformatics, 18(3), 440–445.

Manber, U. and Myers, G. (1990). Suffix arrays: a new method for on-line string

searches. In Proceedings of the first annual ACM-SIAM symposium on Discrete al-

gorithms, SODA ’90, pages 319–327, Philadelphia, PA, USA. Society for Industrial

and Applied Mathematics.

Myers, G. (1999). A fast bit-vector algorithm for approximate string matching based

on dynamic programming. J. ACM, 46, 395–415.

Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the

search for similarities in the amino acid sequence of two proteins. Journal of Molec-

ular Biology, 48(3), 443 – 453.

Nielsen, H. B., Wernersson, R., and Knudsen, S. (2003). Design of oligonucleotides

93


for microarrays and perspectives for design of multi-transcriptome arrays. Nucleic

Acids Research, 31(13), 3491–3496.

Noe, L. and Kucherov, G. (2005). Yass: enhancing the sensitivity of dna similarity

search. Nucleic Acids Research, 33(suppl 2), 540–543.

Nordberg, E. K. (2004). Yoda: selecting signature oligonucleotides. Bioinformatics,

21(8), 1365–1370.

Owczarzy, R., Moreira, B. G., You, Y., Behlke, M. A., and Walder, J. A. (2008). Pre-

dicting stability of dna duplexes in solutions containing magnesium and monovalent

cations. Biochemistry, 47(19), 5336–5353. PMID: 18422348.

Puglisi, S. J., Smyth, W. F., and Turpin, A. H. (2007). A taxonomy of suffix array

construction algorithms. ACM Comput. Surv., 39.

Rahmann, S. (2003). Fast large scale oligonucleotide selection using the longest com-

mon factor approach. Journal of Bioinformatics and Computational Biology, 1(2),

343–361.

Reymond, N., Charles, H., Duret, L., Calevro, F., Beslon, G., and Fayard, J.-M.

(2004). Roso: optimizing oligonucleotide probes for microarrays. Bioinformatics,

20(2), 271–273.

Rimour, S., Hill, D., Militon, C., and Peyret, P. (2005). Goarrays: highly dynamic

and efficient microarray probe design. Bioinformatics, 21(7), 1094–1103.

Rouillard, J., Zuker, M., and Gulari, E. (2003). Oligoarray 2.0: design of oligonu-

cleotide probes for dna microarrays using a thermodynamic approach. Nucleic

Acids Research, 31(12), 3057–3062.

94


Rychlik, W., Spencer, W., and Rhoads, R. (1990). Optimization of the annealing

temperature for dna amplification in vitro;. Nucleic Acids Research, 18(21), 6409–

6412.

SantaLucia, J. (1998). A unified view of polymer, dumbbell, and oligonucleotide

dna nearest-neighbor thermodynamics. Proceedings of the National Academy of

Sciences, 95(4), 1460–1465.

Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular sub-

sequences. Journal of Molecular Biology, 147(1), 195–197.

Smyth, W. F. (2003). Computing Patterns in Strings. Pearson Addison-Wesley.

Villarreal, M. R. (2008). Main protein structures levels. In Wikipedia, the Free

Encyclopedia.

Wang, X. and Seed, B. (2003). Selection of oligonucleotide probes for protein coding

sequences. Bioinformatics, 19(7), 796–802.

Ziv, J. and Lempel, A. (1977). A universal algorithm for sequential data compression.

Information Theory, IEEE Transactions on, 23(3), 337 – 343.

95

oligonucleotide probe design for large genomes using ... · oligonucleotide probe design for large...

Documents