1 the art of dna strings: sixteen years of dna coding theorythe art of dna strings: sixteen years of...
TRANSCRIPT
1
The Art of DNA Strings: Sixteen Years of DNACoding Theory
Dixita Limbachiya, Bansari Rao and Manish K. GuptaDhirubhai Ambani Institute of Information and Communication Technology
Gandhinagar, Gujarat, 382007 IndiaEmail:[email protected], bansari s [email protected] and [email protected]
Abstract
The idea of computing with DNA was given by Tom Head in 1987, however in 1994 in a seminal paper, the actual successfulexperiment for DNA computing was performed by Adleman. The heart of the DNA computing is the DNA hybridization, however,it is also the source of errors. Thus the success of the DNA computing depends on the error control techniques. The classicalcoding theory techniques have provided foundation for the current information and communication technology (ICT). Thus itis natural to expect that coding theory will be the foundational subject for the DNA computing paradigm. For the successfulexperiments with DNA computing usually we design DNA strings which are sufficiently dissimilar. This leads to the constructionof a large set of DNA strings which satisfy certain combinatorial and thermodynamic constraints. Over the last 16 years, manyapproaches such as combinatorial, algebraic, computational have been used to construct such DNA strings. In this work, we surveythis interesting area of DNA coding theory by providing key ideas of the area and current known results.
I. INTRODUCTION
Figure 1: DNA structure withits four nucleotides A-Adenine,G-Guanine, C-Cytosine and T -Thymine. These are the basic build-ing blocks of DNA which are heldby Hydrogen bonds in a doublehelical manner. Each DNA baseis paired with the complementarybases such that the yellow band isA which is connected to its comple-mentary base T (green band) whileC (blue band) is paired with red G(red band)
Information and Communication Technology (ICT) has come a long way in the last 60years. We have now connected world of devices talking to each other or performing thecomputation. This is possible due to the advancement of coding and information theory. In1994, two new directions in ICT emerged viz. quantum computing and DNA computing.The heart of any computing or communications is coding theory. Thus two new directionsin coding theory also emerged such as quantum coding theory and DNA coding theory. Thispaper focuses on the later area giving a comprehensive overview of techniques and challengesof DNA coding theory from last 16 years. In 1994, L.Adleman performed the computationusing DNA strands to solve an instance of the Hamiltonian path problem giving birth toDNA computing [1]–[4]. DNA is a double strand built by the pairing the four basic buildingunits A-(Adenine), C-(Cytosine), G-(Guanine), T-(Thymine) which are called nucleotides.The DNA strand is held by the important feature called complementary base pairing whichconnects the Watson Crick complementary bases with each other denoted by AC = T andGC = C. The backbone of DNA is an alternating chains of sugar and phosphate. DNA isan ideal source of computing due to its stable, dense and its self replicating property [5].After the successful experiment by Adleman, area of DNA computing [6] flourished intodifferent directions like DNA tile assembly [7] [8], building of DNA nano-structures [9][10], studying error correcting properties in DNA sequences [11] [12] and DNA based datastorage system [13]. But it was only with the identification of mathematical properties inDNA sequences [14] [15], it inspired many researchers to explore this new amalgamation ofbiology and coding theory [16]. DNA computing [17] promises to solve many NP-completeproblems like Hamilton path problem, satisfiability problem (SAT) [18] and many others with massive parallelization [19]. Toperform computation using DNA strands, a specific set of DNA sequences are required with particular properties. Such set ofDNA strands satisfying various constraints [20], [21] play an important role in biomolecular computation [22] [23].
The aim of this manuscript is to elucidate the current state of art of DNA codes as shown in Figure 2, DNA codes constraintsand especially the new methods contributing to construct DNA codes from algebraic coding. It summarizes the research onDNA codes which may help the researcher to get a broad picture of the area. Also, it gives brief overview on the applicationsof DNA codes.
The Paper is organized as follows. Section II and III discusses the DNA code design problem and constraints on DNA codes.Section IV describes various approaches for the construction of DNA codes. Section V gives a brief description on softwaretools for DNA sequence design. Section VI includes description on bounds of DNA codes. Section VII gives a brief note onapplications of DNA codes. Section VIII shows results on DNA codes for 0 ≤ n ≤ 36 and 0 ≤ d ≤ 20. Section IX concludesthe paper with general remarks.
arX
iv:1
607.
0026
6v1
[cs
.IT
] 1
Jul
201
6
II. DNA CODE DESIGN PROBLEM
To perform the DNA computation, DNA strands react with each other by Watson Crick base pairing and form perfectmatch. But in some situation, DNA strands may not form perfect base pairing and react in undesirable manner. One situationis formation of secondary structure in which first half strand of the DNA strand forms complementary with its own otherhalf forming the hair pin like structure. This kind of structure interrupt the desired computation. Such secondary structureshave to be avoided by designing the DNA strands carefully. Also DNA strands may bind to another DNA strands forming thecomplementary base pairing with few base pairs creating an error. One way to avoid this is to ensure that every two DNAstrands differ in more than d locations where d depends on the application of DNA computation. This property can be obtainedby defining different distance like the Hamming distance, Lee distance, Edit distance, Deletion distance etc. between two DNAstrands.
DNA codes design problem [24]–[26] is to develop a set of the DNA codewords of length n over DNA alphabets ΣDNA ={A, T,G,C} with predefined distance d [27] [28] and satisfying maximum set of constraints [29]. The main objective is tofind the largest possible set of codewords M of length n over alphabet ΣDNA = {A,C,G, T} feasible with respect to set ofconstraints [30] with distance d that enable the construction of better error correcting codes [31]–[33].
Figure 2: DNA codes Time line: The state of art of DNA codes is showcased from 1994 to 2016. Different work mentionedwith authors in chronological order.
Definition (DNA code). A DNA code CDNA(n,M, d) ⊂ ΣDNA = {A, T,G,C} with each DNA codeword of length n andsize M and minimum distance d. Here A denotes Adenine, T denotes Thymine, G denotes Guanine and C denotes Cytosineas the nucleotides in DNA.
III. CONSTRAINTS ON DNA CODES
There are different types of constraints that DNA codes must satisfy. Three categories in which these constraints can beclassified are as follows:
1) Combinatorial constraints2) Thermodynamic constraints3) Application Oriented constraintsThe constraints which should be followed by DNA codes are listed below :1) Hamming distance constraint (n, d, w) -
The Hamming distance constraint can be defined as dH(xDNA, yDNA) ≥ d ∀ xDNA, yDNA ∈ CDNA for some Hammingdistance d [14]. A set of codewords with length n, size M and minimum Hamming distance d satisfying Hammingconstraint is denoted by CDNA(n,M, d). Hamming distance is calculated by the total number of places at which twoDNA codewords differ. For CDNA(n,M, d), the minimum of the distances is considered as the Hamming distance.Aq(n, d) denotes the maximum size of a code with codewords of length n and distance d over alphabet size q, in caseof DNA codewords q = 4.Example 1. Let xDNA = ATGACT and yDNA = ACTAGC, then dH(xDNA, yDNA) = 4 where xDNA, yDNA ∈ CDNA
following the Hamming distance constraint with dH(xDNA, yDNA) ≥ 3.2) Reverse constraint(n, d) - The reverse constraint is HDNA(xR
DNA, yDNA) ≥ d ∀ xDNA, yDNA ∈ CDNA. A code satisfyingreverse constraint is called reverse code. AR
q (n, d) denotes the maximum size of a reverse code with length n andminimum Hamming distance d.
Page 2 of 19
Figure 3: DNA codes constraints classified as three categories combinatorial, thermodynamics and application orientedconstraints.
Example 2. Let xDNA = ATGACT , xRDNA = TCAGTA and yDNA = ATACAT . For n = 6 and d = 3, HDNA(xR
DNA, yDNA) ≥3. xDNA, yDNA are reverse code.
3) Reverse Complement constraint (n, d) - This RC-constraint is HDNA(xRDNA, yC
DNA) ≥ d ∀ xDNA, yDNA ∈ CDNA. DNAcode satisfying RC-constraint is called a reverse complement code. ARC
q (n, d) denotes the maximum size of a reversecomplement code with length n and minimum Hamming distance d [14].Example 3. Let xDNA = ATGACT , xR
DNA = TCAGTA and yDNA = GTACAC, yCDNA = CATGTG. For n = 6 and
d = 3, HDNA(xRDNA, yC
DNA) = 4 ≥ 3. xDNA, yDNA are reverse complement code.4) GC-content constraint (n, d, w) - The set of codewords with length n, distance d and GC weight w ,where w is total
number of Gs and Cs present in the DNA strand viz. wxDNA= |{xi : xDNA = (xi) , xi ∈ {C,G}}| [34]–[36]. Generally
w = bn/2c.Example 4. Let xDNA = ATTGCT then xDNA /∈ CDNA for n = 6 and w = 3.
5) Melting temperature constraint (n, d, w)- Melting temperature Tm is a temperature at which half of the DNA strands arehybridized and half are not. Melting is opposite of hybridization in which two strands get separated. It is advantageousto have the codewords with the similar melting temperatures as it will enable the hybridization of multiple DNA strandssimultaneously. Hence we select the codewords with similar melting temperature calculated by Nussinov’s algorithm[25]. For each xDNA ∈ CDNA have identical melting temperature [29] .
6) Thermodynamic constraint (n, d, w) - For the DNA stability, it is necessary for all the DNA codes should have comparablefree energy ∆G◦ [37]–[39] above some threshold [40], [41]. As DNA with minimum free energy is more stable and hencewe consider only those DNA codewords in a DNA code that have approximately comparable free energy in the set ofDNA codewords. Free Energy ∆G◦ for the given DNA codewords xDNA, yDNA ∈ CDNA can be obtained by the equation,
|∆G◦(xDNA)−∆G◦(yDNA)| ≤ δ
where δ > 0 is a constant.7) Uncorrelated-correlated constraint (n, d, w) - A codeword ∈ CDNA if shifted by x units where x ≤ n should not match
with any of the other codeword ∈ CDNA [42].
Example 5. Let XDNA = CATCATC and YDNA = ATCATCGG. X ◦ Y = 0100100, as depicted below.
X = C A T C A T CY = A T C A T C G G 0Y = A T C A T C G G 1Y = A T C A T C G G 0Y = A T C A T C G G 0Y = A T C A T C G G 1Y = A T C A T C G G 0Y = A T C A T C G G 0
Page 3 of 19
The constraints (1) to (3) are used to avoid undesirable hybridization between different DNA strands and (4) to (6) arethe DNA constraints which ensures that all the codewords have similar thermodynamic characteristics to perform uniformcomputation.
IV. VARIOUS APPROACHES FOR THE DNA CODES CONSTRUCTIONS
There are different approaches [43] for the construction of the DNA codes with finite length n, defined distance d and set ofconstraints with respect to application. The set of constraints essential for the DNA codes is subject to the application. Thereexist algorithmic, theoretic and software simulation method [44] [45] [46] approaches to design the DNA codes. Optimality ofthe DNA code can be obtained by construction of the DNA codes in a way that every codeword in the set follows maximumnumber of constraints for a large value of n and large minimum distance d with minimum errors in DNA computation [47].
Figure 4: DNA codes construction Methods are classified as Search algorithms, Algebraic Coding methods and Softwaresimulations. Search algorithms include Variable neighborhood search, Simulated Annealing, Stochastic Local Search. TemplateBased method uses a template to design a DNA codeword. Algebraic coding method is construction of DNA codes by usingAlgebraic structures like Fields and Rings. Altruistic Approach is construction of DNA codewords in altruistic way in thiswork.
1) Variable Neighborhood Search Approach: This approach employs different local search algorithms to search the DNAcodes [48]–[51]. Some of the search algorithms are described here.
a) Seed Building(SB) : Seed Building (SB) algorithm examines all the possible codewords randomly with respect toseed codeword where seed codewords are the initial set of codewords with the required constraint. For more detailson the algorithm, [52] can be referred. The drawback of the approach is that it is inefficient for larger values of nbecause of the computational and time complexity invested for the development of feasible set of codewords.Example 6. Consider the n = 3, d = 1 and GC-content w = 2 and initial seed be codewords with d = 1 andw = 2 are AGG and AGC then the resulting codewords at first iteration with HD and GC-content constraints areGGC,GCA,GTC,CCG,CGC,CTC, TGG. Note that ACC,GCT,CAG are also codewords with GC-content2 but at distance d = 3 so will be added in the next iteration for seed d = 3.
b) Clique Search : In this method a random subset of the codewords of a given code is removed, leaving a partialcode. All the codewords are generated and those compatible with the ones left in the code are identified and agraph is built where the identified codewords are nodes, and an edge exists between compatible codewords.Example 7. Let n = 4, d = 3 and w = 2 then the partial code be CTTC,CGAA, TGGT,GTGA with HD, RCand GC constraints. The clique search will result in codewords CACT,GCTT,AGTG and AAGC.
c) Hybrid Search: This method combines two approaches of seed building and clique search. It uses seed building togenerate the partial code and clique search to search for the best codewords [53].
d) Greedy Approach: These types of algorithms removes the worst DNA code at each iteration from a set of codesevery time the algorithm is iterated. But the problem with this approach is that it doesn’t always produce the bestresults. In the process of removing the worst code at each stage it may remove a potentially good code at earlierstages and may not remove bad code at the later stages. So it doesn’t always produce the best optimal result [54].Example 8. Let n = 3, d = 2 and w = 2 then codewords from greedy search AGG,ACC,GGC,GCA,CCG,CTC.
e) Lexicographic Approach: These types of algorithms take into consideration a particular arrangement of the DNAcodewords. It may be an alphabetical order or ascending order or any such kind of ordered arrangement of codessuch that the arrangement of the DNA codewords satisfying the DNA constraints are ordered in a specific manner[55].
2) Simulated Annealing Approach: Simulated Annealing is a meta heuristic algorithm derived from thermodynamicprinciples [52]. These types of algorithms work on the set of codewords in which not all the codewords in the given set
Page 4 of 19
satisfies the constraints specified. These algorithms attempts to change the codewords with the objective of reducing theconstraint violations to the constraints and try to make them feasible. The set with a feasible solution is derived whenthere is no violation of constraints.
3) Stochastic Local Search Approach: We search for the codewords with parameters (n, d, w = n/2) such that the lengthof the codewords is n and the GC-content is bn/2c and the minimum Hamming distance is atleast d. These types ofalgorithms work on random set of codewords while solving the problem. It generally considers an initial set of random kcodewords and then removes the one which doesn’t satisfy the constraints. For more details on the the algorithm, readercan refer to [56].Example 9. Let random codewords k = 16 then set is AA,AT,AG,AC,GA,GT,GC,GG,CT,CA,CC,CG, TA, TC, TG, TT . Codewords with n = 2, d = 2 and w = 1 are GC,AG,CA, TC.
4) Genetic Algorithm: In this work, genetic algorithm is used to search efficient and reliable codes by minimizing themis-hybridization error. This codewords generated were unique in terms of Hamming distance that satisfy the Hammingbound [57].
5) Template Based Method: This method was introduced in [58]–[60] that involves two step process. Initially a templateis designed which is mapped to binary error correcting codes and combination of template and codeword results intoDNA codewords. This method is not optimal because the code size is limited depending of the size of error correctingcode used and selection of template. Mapping of the template to the codewords is subjected to specific application.Example 10. Let template t = 1001101 and a codeword c = 0010111. Suppose the map define 11 → A, 10 → T ,01 → G and 00 → C for position of 1 and 0 in t is 1 = [AT ] and 0 = [GC] followed by position in codewords thenthe DNA codes will be TCGTAGA.
6) Algebraic Coding Approach: By using the algebraic coding, DNA codes are constructed from fields and rings bymapping the elements of the field and rings to the DNA nucleotides [61].
a) Codes over Fields:i) Linear codes over GF (4) : In this approach, DNA codewords are constructed from GF (4) by using different
one-to-one mapping the elements of GF (4) to DNA nucleotides [62]. The mapping is preferred from {0,1,ω, ω2} to {A,C,G,T} with respect to the codes used. There linear code [35] and additive codes.The methodused have improved the lower bounds on GC constraints and extended the result on the length of DNA code ton ≤ 30. Researchers extended this construction for non linear codes and cyclic codes [63]. Also DNA codes overGF (4) were constructed using BCH codes in [64] in which the protein and targeting sequences are identifiedas codewords of error-correcting BCH codes.
ii) Linear and Additive codes over GF (4): In this paper, DNA codes are constructed considering linear codesand additive codes over GF (4) pf odd length following Hamming distance constraint and reverse complementconstraint. In [65], DNA codes of length 7, 9, 11 and 13 have been considered. DNA nucleotides {A,C,G, T}have been mapped to 0, ω, ω and 1 respectively with ω = ω2 and ω2 + ω + 1 = 0. Each codeword has beenmapped to a polynomial and the Trace map Tr : GF (4)→ GF (2) is stated as :
Tr(x) = x+ x2
iii) Extended, Additive, Additive Extended Cyclic codes over GF (4): In referred work, DNA codes satisfying GC-content constraint and a minimum Hamming distance constraint were constructed using computer algebra systemsMagma [66] and Maple [67]. Longer codes of higher length 4 ≤ n ≤ 30 were derived from GF (4), additivecodes over GF (4) and Z4 (see Figure 5). Moreover it was claimed that by using different mapping from fields orrings to DNA codewords can result into different lower bounds. Further the bounds on the DNA codes satisfyingset of constraints were ameliorated by shortening and puncturing of obtained codes [68].
b) Codes over Rings : Algebraic construction of DNA codes was further extended to codes over rings. Different ringsare used to construct DNA codes by mapping rings elements to DNA nucleotides.i) DNA sequences generated by Z4 linear codes: In [69], a biological coding system which modeled the existence
of error correcting codes in the DNA structure. Model consists of an encoder and modulator. The encoder consistof a mapper that converts the DNA nucleotides to elements of and BCH codes over Z4) and a modulator consistof a genetic code, tRNA (transfer RNA that serves as the connecting link between the mRNA (messenger RNA)to the amino acid sequence of proteins) and ribosome (protein synthesizer of the cell) which is associated withsignals that convert the genetic codons to protein. The DNA and protein coding sequences from different specieshave been identified as the codewords over linear codes over Z4. A class of error correcting-code BCH codeswith parameters (n, k, 3) have been used in the encoder to construct the DNA codes over Z4.
In [70] [71], the self dual codes over Z4 are used for construction for DNA codes. Additionally, GC weightenumerator of the DNA codes that determines the number of Gs and Cs in the codeword. GC wright enumeratorhelps in the construction of DNA codes satisfying GC- content constraint. Self dual DNA codes over Z4 are
Page 5 of 19
Cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 30.
Extended cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 30.
Additive cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 19.
Additive extended cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 20.
Cyclic DNA codes over Z4 have been computed for 4 ≤ n ≤ 24.
Extended cyclic DNA codes over Z4 have been computed for 4 ≤ n ≤ 24.
Cosets of cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 20.
Cosets of extended cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 20.
Cosets of additive cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 14.
Cosets of additive extended cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 15.
Cosets of cyclic DNA codes over Z4 have been computed for 4 ≤ n ≤ 20.
Cosets of extended cyclic DNA codes over Z4 have been computed for 4 ≤ n ≤ 30.
Figure 5: DNA codes construction methods using Computer Algebra Systems such as Maple and Magma [68] .
developed by using mapping as A → 0 , C → 1, T → 2 and G → 3. The following generator matrix G overZ4 is considered and let K4 denote the codeword over Z4 formed from G. There are 16 codewords and wa(c)where a ∈ Z4 and k ∈ K4 as shown in Table I.
G =
1 1 1 10 2 0 20 0 2 2
K4 w0 w1 w2 w3 K4 w0 w1 w2 w3
(0000) 4 0 0 0 (1111) 0 4 0 0(2222) 0 0 4 0 (3333) 0 0 0 4(0202) 2 0 2 0 (1313) 0 2 0 2(2020) 2 0 2 0 (3131) 0 2 0 2(0022) 2 0 2 0 (1133) 0 2 0 2(2200) 2 0 2 0 (3311) 0 2 0 2(0220) 2 0 2 0 (1331) 0 2 0 2(2002) 2 0 2 0 (3113) 0 2 0 2
Table I: (4, 16, 3) code is generated by G where wa(c) where a ∈ Z4 and k ∈ K4 [70].
ii) Lifted Polynomials over F16: In [72], reversible codes by using special family of polynomials denoted as liftedpolynomials over F4 which generates the reversible codes of odd length over F16 are constructed. 4-liftedpolynomial is used to generate the DNA code of even length by using the correspondence between pair ofDNA nucleotides to elements of the ring F16. Table II preserves the property that if DNA pair is mapped to anelement of F16 then reverse of that DNA pair is mapped to fourth power of the element of F16. For example,α2 → GC then (α2)4 → CG.
iii) DNA codes over F2[u]/u4 − 1: In [73], cyclic DNA codes of odd length are obtained from F2[u]/(u4 − 1) ={a + bu + cu2 + du3 | a, b, c, d ∈ F2} where u4 = 1 commutative ring is considered. All its ideals are listedhere 〈0〉 = 〈(1 + u)4〉 ⊂ 〈(1 + u)3〉 ⊂ 〈(1 + u)2〉 ⊂ 〈(1 + u)〉 ⊂ R. The 16 elements of ring R are mapped to2-length nucleotides. The following mapping shown in Table III was considered in the paper [73]. This mappingpreserved the complementary and reverse property by adding 1 + u+ u2 + u3 and multiplying u2 respectively.To find complement of AA, we add 1 + u+ u2 + u3 to 0, it will give 1 + u+ u2 + u3 = TT . To find reverseAA, multiply 0 to u2 will result in 0 =AA.In [74] DNA cyclic codes of arbitrary length satisfying the reverse complement constraint are constructed byusing additive stem distance. The correspondence between the ring elements and DNA is established by followingmapping mentioned in Table IV. this preserves the reverse complement property of the DNA codewords by the
Page 6 of 19
Sr.No DNA Pair a Multiplicative(F16) Additive1. AA 0 -2. TT α0 13. AT α1 α4. GC α2 α2
5. AG α3 α3
6. TA α4 1 + α7. CC α5 α+ α2
8. AC α6 α2 + α3
9. GT α7 1 + α+ α3
10. CG α8 1 + α2
11. CA α9 α+ α3
12. GG α10 1 + α+ α2
13. CT α11 α+ α2 + α3
14. GA α12 1 + α+ α2 + α3
15. TG α13 1 + α2 + α3
16. TC α14 1 + α3
Table II: Mapping from DNA nucleotide Pair to element of F16. [72]
Element Map Element Map Element Map Element MapAA 0 AT 1 + u GT 1 CT 1+ u + u2
TT 1 + u + u2 + u3 TA u2 + u3 TG u2 TC 1 + u2 + u3
GG 1 + u2 GC u+ u2 AC 1 + u+ u3 AG uCC u+ u3 CG 1 + u3 CA u+ u2 + u3 GA u3
Table III: Mapping from DNA Nucleotide Pair to elements of the Ring F2 + uF2 + u2F2 + u3F2 where u4 = 1 [73].
x+ xc = u3 + u2 + u+ 1. The reverse of the DNA code is obtained by multiplying u2 to any element x of thering R.
Element Map Element Map Element Map Element MapGG 0 AT 1 + u GT 1 CT 1+ u + u2
CC 1 + u + u2 + u3 TA u2 + u3 TG u2 TC 1 + u2 + u3
GC 1 + u2 AA u+ u2 AC 1 + u+ u3 AG uCG u+ u3 TT 1 + u3 CA u+ u2 + u3 GA u3
Table IV: Mapping from DNA Nucleotide Pair to elements of the Ring F2 + uF2 + u2F2 + u3F2 where u4 = 1 [73].
iv) DNA cyclic codes over F2 + uF2 where u2 = 0: DNA codes of even length following reverse and reversecomplement constraints have been studied in [75]. The field F2 is a subring of R. A linear code C of lengthn over R is defined to be an additive submodule of the R-module Rn. A cyclic code of length n over Ris a linear code with the property that if (c0, c1, . . . , cn−1) ∈ C then (cn−1, c0, . . . , cn−2) ∈ C. An n-tuplec = (c0, c1, . . . , cn−1) ∈ Rn is identified with the polynomial c0 + c1x + . . . + cn−1x
n−1 in the ring Rn =R[x]/ (xn − 1), which is called the polynomial representation of c = (c0, c1, ..., cn−1). Here R = F2 +uF2 with{0, u, 1, 1 + u} elements are in one to one correspondence with nucleotides A, T,G and C such that 0 → A,u → T , u+ 1 → C and 1 → G. The DNA codes of length 8 and 10 are obtained from C = g6 + u
(x5 + x
)and C = g1g
22 + ug2, ug1g2. Necessary and sufficient conditions for cyclic codes to follow the reverse and
reverse-complement properties have also have been studied. To preserve the reverse complement constraint ofind complement of A, we add u to 0, it will give u = T.
v) DNA codes over Z4+uZ4: DNA cyclic codes of odd lengths following reverse and reverse complement constraintare constructed in [76]. Here ring Z4+uZ4 = a+ ub | a, b ∈ Z4 with u2 = 0 is considered. Reversible and cyclicreversible complement codes are discovered in this paper. In this work defined a Gray map that allows them totranslate the properties of DNA codes to binary codewords i.e. φ : Z4+uZ4 → Z2
4 such that φ(a+ub) = (b, a+b)where a, b ∈ Z4. In this paper, 16 pairs of nucleotides which are mapped as shown in Table V.
Element Map Element Map Element Map Element MapAA 0 TT 1+u GG 1 CC uAT 2 TA 3+u GC 3 CG 2+uGT 2u CA 1+3u AC 3u TG 1+2uCT 2+3u GA 3+2u AG 2+2u TC 3+3u
Table V: Mapping from DNA nucleotide pair to element of the Ring Z4 + uZ4 where u2 = 0 [76].
vi) F4[u]/ < u2 + 1 > where u2 = 1: In [77] self-reciprocal complement cyclic codes from R with F4 ={0, 1, α, α+ 1} where α is the root of primitive polynomial x2 + x+ 1 over F2 are studied. The DNA code ofspecific length 6 over the ring is considered. One to one correspondence is establish between pairs of nucleotides
Page 7 of 19
and 16 elements of the ring. DNA cyclic codes constructed followed reverse complement, GC content andHamming distance constraints. The basis is {1, u+ 1} and then every element of R is expressed in the form ofa+ b(u+ 1) where a,b ∈ F4. The mapping is done as in Table VI
Element Map Element Map Element Map Element MapAA 0 TT α+ 1(α+ 1)u GT 1 CA α+ (α+ 1)uAG α TC 1 + (α+ 1)u AT α+ 1 TA (α+ 1)uTG u AC α+ 1 + αu GA αu CT 1 + α+ uGC 1 + αu CG α+ u CC 1 + u GG α+ αu
Table VI: Mapping between the pair of DNA nucleotides and Ring F4 [77].
vii) DNA codes from F2 + uF2 + vF2 + uvF2 with u2 = 0 and v2 = v: The structure of cyclic DNA codes of anarbitrary length over R2 = F2 +uF2 +vF2 +uvF2 was studied and the relation to codes over R1 = F2 +uF2 bydefining Gray map between R2 and R2
1 was established [78]. DNA codes following reverse, reverse complementconstraints are studied. The Gray map from R2 to R1 is defined as φ(a+ bv) = (a, a+ b). GC weight over thering was also introduced by using image of Gray map. One type of nontrivial automorphisms can be definedover R2 as follows : σ : F2 + uF2 + vF2 + uvF2 → F2 + uF2 + vF2 + uvF2,a+ bv → a+ (1 + v)b such that a, b ∈ F2 + uF2. Table is defined using : Φ (c) : C → S2n
D4,(a0 + b0v, a1 + b1v, . . . , an−1 + bn−1v) 7→ (a0, a1, . . . , an−1, a0 + b0, a1 + b1, . . . , an−1 + bn−1). Below is theTable VII for mapping elements of the ring to DNA described in the paper. For instance, (c0, c1, c2, c3) =(u+ v, u, v, 1) is mapped to TCTTAGTCGG.
Elements a Gray ImagesDoubleDNA Pairsζ(a)
Elements a Gray Images Double DNAPairs ζ(a)
0 (0,0) AA v (0,1) AGuv (0,u) AT v+uv (0,1+u) AC1 (1,1) GG 1+v (1,0) GA1+uv (1,1+u) GC 1+v+uv (1,u) GTu (u,u) TT u+v (u,1+u) TCu+uv (u,0) TA u+v+uv (u,1) TG1+u (1+u,1+u) CC 1+u+v (1+u,u) CT1+u+uv (1+u,1) CG 1+u+v+uv (1+u,0) CA
Table VII: Mapping between DNA pair and elements of the Ring F2 + uF2 + vF2 + uvF2 with u2 = 0 and v2 = v [78].
viii) codes over F4 + vF4: In linear, constacyclic and cyclic codes over the ring R = F4[v]/(v2− v) are constructedin [79]. The ring R = {a + vb|a, b ∈ F4} is non-chain finite semi-local Frobenius ring with 16 elements. The4 elements are F4 = {0, 1, ω, ω + 1} where ω2 = ω + 1. The Gray map from R to F4 × F4 is given by :
φ(c) = (a+ b, a).If a = (a1, a2, . . . , an) ∈ Rn, then the Hamming weight of a is the sum of the Hamming weights of itscomponents, i.e.w(a) =
∑ni=1 w(ai). The Hamming distance between a and b in R is d(a, b) = w(a− b). The
Lee weight of any element of R is the Gray image of its Hamming weight, i.e. wL(c) = wH(φ(c)). Below isthe Table VIII for mapping used in [80].
Elements Gray Images Double DNAPairs ξ(a) Elements Gray Images Double DNA
Pairs ξ(a)0 (0, 0) AA 1 (1, 1) TTω (ω, ω) CC 1 + ω (1 + ω, 1 + ω) GGv (1, 0) TA 1 + v (0, 1) ATv + ω (1 + ω, ω) GC 1 + v + ω (ω, 1 + ω) CGvω (ω, 0) CA 1 + vω (1 + ω, 1) GTω + vω (0, ω) AC 1 + ω + vω (1, 1 + ω) TGv + vω (1 + ω, 0) GA 1 + v + vω (ω, 1) CTω + v + vω (1, ω) TC 1 + v + ω + vω (0, 1 + ω) AG
Table VIII: Mapping between DNA Pair and elements of the Ring F4 + vF4 [79].
ix) R = F2[u]/(u6): In [81], DNA cyclic codes over a family R1 = F2[u]/(u6) and ring R2 = F2 + uF2 wherev2 = v satisfying reverse complement constraint have been constructed. In this a new family of DNA skewcyclic codes is introduced over ring R = F2 + vF2 = 0, 1, v, v + 1 where v2 = v. The ring R1 = F2[u]/(u6) ={a0 + a1u + a2u
2 + a3u3 + a4u
4 + a5u5; ai ∈ F2, u
6 = 0}. There is direct map between 64 elements of thering to 64 codons (three nucleotides) used in nature as a substrate for aminoacid synthesis shown in Table IX.
x) DNA codes over ring F2[u]/u2 − 1 : Ring R = F2[u]/u2 − 1 = {0, 1, u, 1 + u} where u2 = 1 was described[82] [83] . Elements of the ring were directly mapped to DNA nucleotides such that reverse complement
Page 8 of 19
DNACodons Ring Element DNA
Codons Ring Element DNACodons Ring Element DNA
Codons Ring Element
CCC u5+u4+u3+u2+u+1 GGG 0 ACT u3 + u2 + u+ 1 GTC u4 + u2 + u+ 1GGA u5 + u4 + u3 + u2 + u CCT 1 ACG u3 + u2 + u ACA u3 + u2 + u+ 1GGC u5 + u4 + u3 + u2 +1 CCG u TTT u4 + u2 + 1 GAC u5 + u3 + u2 + 1GGT u5 + u4 + u3 + u2 CCA u+ 1 TTG u4 + u2 + u AGG u5 + u3 + u+ 1AGG u5 + u4 + u3 + u+ 1 TCC u2 CTA u4 + u+ 1 GAT u5 + u3 + u2
CGG u5 + u4 + u2 + u+ 1 GCC u3 GTT u4 + u3 + 1 GTA u4 + u3 + u+ 1GAG u5 + u3 + u2 + u+ 1 CTC u4 GTG u4 + u3 + u ATT u4 + u3 + u2 + 1AGA u5 + u4 + u3 + u2 + u TCT u2 + 1 TCA u2 + u+ 1 ATA u4 + u3 + u2 + uAGC u5 + u4 + u3 + 1 TCG u2 + u CAA u5 + u2 + u ATC u4 + u3 + u2
ATG u4 + u3 + u2 + u+ 1 TAC u5 CAC u5 + u2 + 1 TGA u5 + u4 + uAGT u5 + u4 + u3 TAT u5 + 1 GCA u3 + u+ 1 AAT u5 + u2 + u+ 1CGA u5 + u4 + u2 + u GCT u3 + 1 TTA u4 + u3 AAA u5 + u3 + uCGC u5 + u4 + u2 + 1 GCG u3 + u ACC u3 + u2 TGC u5 + u4 + 1CGT u5 + u4 + u2 TAA u5 + u CAT u5 + u2 AAC u5 + u3 + 1TGG u5 + u4 + u+ 1 CTG u4 + u TGT u5 + u4 TCC u4 + u2
GAA u5 + u4 + u3 + u2 + u CTT u4 + 1 CAG u5 + u3 TAG u5 + u+ 1
Table IX: Mapping between 64 elements of the Ring F2 + uF2 + u2F2 + u3F2 + u4F2 + u5F2 and 64 codons [81].
constraint is conserved. By adding 1 + u to elements of rings, complement can be obtained. For example0→ A, u→ C, 1→ G, 1 + u→ T preserves the complement property: 0 = 1 + u and u = 1.The elements of the ring R can be mapped to the elements of F2 = {0, 1} via the map θ where θ(0) =θ(u + 1) = 0 and θ(1) = θ(u) = 1. Let C be a cyclic code in Rn. It can be extended to the map θ to a mapϕ : C → Z2[x]/(xn−1) defined by ϕ(a0+a1x+a2x
2+. . .+an−1xn−1) = θ(a0)+θ(a1)x+. . .+θ(an−1x
n−1).xi) DNA cyclic codes over F2 + uF2 : In [84] odd length codes over rings satisfying reverse complement, GC-
Content and thermodynamic constraints are studied. They are obtained from the cyclic complement reversiblecode. Infinite family of BCH DNA codes are constructed. The mapping Φ called Gray Map has been used tomap linear codes over R to binary linear codes. The Gray map Φ is the distance-preserving map (Rn, Leedistance) → (F 2n
2 ,Hamming distance).xii) Cyclic DNA codes over Z4 + wZ4 : Recently, cyclic DNA codes over R = Z4 + wZ4 where w2 = 2 and
S = Z4 + wZ4 + vZ4 + wvZ4 where v2 = vw have been discovered [85]. In this work, odd length DNAcyclic codes over R satisfying reverse and reverse complement constraint is studied . Also, a family of DNAskew cyclic codes with reverse complement property over R is constructed. Binary images of the cyclic DNAcodes over R and S is determined. The correspondence between elements of ring R and double DNA pairs areestablish as described in the Table X.
Elements Gray Images Double DNAPairs ξ(a) Elements Gray Images Double DNA
Pairs ξ(a)0 (0, 0) AA 1 (1, 0) CA2 (2, 0) GA 3 (3, 0) TAω (0, 1) AC 2ω (0, 2) AG3ω (0, 3) AT 1 + ω (1, 1) CC1 + 2ω (1, 2) CG 1 + 3ω (1, 3) CT2 + ω (2, 1) GC 2 + 2ω (2, 2) GG2 + 3ω (2, 3) GT 3 + ω (3, 1) TC3 + 2ω (3, 2) TG 3 + 3ω (3, 3) TT
Table X: Mapping between DNA Pair and elements of the Ring.
7) Algebraic Number Theory codes: Aforementioned all the methods include construction of DNA codes from classicalcoding theory and heuristic approaches works well for small length n. In this author constructed the DNA codes usingalgebraic number theory [86], making the first attempt, by using irreducible cyclic codes to built DNA codes with largen (< 1000) and number of codewords (M < 7000) satisfying the GC content constraint.
V. SOFTWARE TOOLS FOR DNA CODES GENERATION
There are different tools developed for designing the DNA codewords namely DNA sequence Generator [44] and evolutionaryalgorithm based program-PUNCH (Princeton University Nucleotide Computing Heuristic) [87] were used to find set ofdissimilar sequences. DNA sequence generator and compiler use graph based approach based on the overlapping sub-sequences.The GUI of the software allows the user to import the the DNA sequence to the sequence wizard with different parameters. It
Page 9 of 19
also calculate the melting temperature of the DNA sequences. It check for the reverse complement constraints and forbiddenDNA strands. However this software do not take care of secondary structure formation of the DNA strands. PUNCH is usedfor performing various DNA computing by bit set selection. It works on randomization by selecting three basic parameters N,B and V where N is number of bits in the problem, B is the number of nucleotides in each bit, and V is number of variationon each bit set.
Here the author has created a web application in which the user has to first select the mode which is either specific basedor range. For both modes, the user has to select the specific constraints he wants DNA codes to follow but for specific theinput has parameters n and d where n is the length of the codewords and d is the Hamming distance. For range, the inputparameters are n1 and n2 where n1 is the starting length and n2 is the ending length and also are d1 and d2 where d1 is thestarting Hamming distance and d2 is the ending Hamming distance. The web portal will display the number of codewords andalso those codewords that satisfy the constraint.
VI. BOUNDS ON DNA CODES
There are several bounds studied for DNA codes. This section comprehend the types of bounds obtained on set of constraints.In the Table XI, columns are ticked for which respective bounds on constraints are obtained. Note that n is the length, d isminimum Hamming distance and w is GC content of the DNA code. One can observe that almost all the methods haveobtained lower bounds on reverse constraint. But most important part is to investigate that there is no bounds on DNA codessatisfying the set of HD, GC, R and RC constraints altogether. Also there are very few attempts made to explore the boundson thermodynamic constraints. One can explore the methods which allows the formulation of bounds on the thermodynamicconstraints. More details on the bounds are described in Appendix A.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
AHD4 (n, d, w) X
AR4 (n, d, w) X X X X X X X X X X
ARC4 (n, d, w) X X
AGC4 (n, d, w) X X X X X
AR,RC4 (n, d, w)
ARC,GC4 (n, d, w) X X X X X X
AR,GC4 (n, d, w) X X
AR,RC,GC4 (n, d, w)
Table XI: Bounds on DNA codes: 1-Johnson type Bound [88] [55], 2-Halving Bounds [88] [55], 3-Gilbert-Type Bounds [55],4- Relation between AGC,RC
4 and AGC,R4 [55], 5- AGC
4 bound, 6- AGC,RC4 bound, 7- AGC
4 (n, d, w) bound, 8- Product Bounds[88] [55], 9- Relation between ARC
4 and AR4 [88], 10 to 16 - Relation between AGC
4 , ARC4 and AR
4 [14], 17-V. Phan bound[89] satisfying Hamming distance constraint.
VII. APPLICATIONS OF DNA CODES
DNA codes are used in various technologies like DNA computing [1], surface based DNA computation [90]–[93], DNAMicroarray technology [94], Molecular barcodes for chemical libraries [95], DNA nanostructures [96], [97], DNA origami
Page 10 of 19
[10], data encrytption [98], [99] [100]–[102], data storage [13], [103]–[106], signal processing [107] and DNA nanodevicesand circuits [108]. Recently, it has reported potential of DNA codes in phylogenetic studies [109]. Also it has contributionin understanding gene regulatory networks [110], protein coding genes [111], [112] and studying the structure of genes viacircular codes [113].
In DNA computing, DNA codes with specific properties are required to perform various parallel and logical operations [114].Molecular barcodes [115] generated from DNA codes are used as biomarkers for authentication of the products. DNA codesare used in creating DNA nanostructures that are used in potential applications like targeted drug delivery systems. DNA codeswith specific properties with high stability and robustness are required for nano structures which can be achieved by usingefficient encoding procedures for DNA codes. Recently, DNA is used in data hiding techniques for encryption of data moreeffectively. DNA codes used in this are designed as encryption or decryption keys. In last few years DNA based data storagesystems have [42] received attention by many researchers. DNA codes used for data storage must have feasible property thatachieve dense data storage capacity and better error correction capacity.
To use DNA codes for any application, fundamental constraints mentioned for DNA codes are unavoidable. DNA codedesign must follow the constraint for stability and robustness but to design the DNA codes for specific application, requiredconstraints must be added to DNA codes to make it more functional and practical. For instance, correlated and uncorrelatedconstraint was added to DNA codes for development random and re-writable DNA based data storage system. Looking atthe potential of DNA codes and advancement in the biotechnology methods, DNA codes promises application in emergingtechnologies.
VIII. DNA CODES TABLE
Many tables on lower bounds of DNA codes satisfying set of constraints are obtained. In Table VIII and XIII lower boundson DNA codes satisfying GC and reverse complement (RC) constraints are mentioned. In Table XIV lower bounds on DNAcodes satisfying Hamming distance and reverse complement constraint. These bounds are compiled from [68] [36] [52] [62][71] [86].
IX. FUTURE WORK AND CHALLENGES
DNA codes designing have received a great deal of attention by researchers in the last decade. In spite of different approachesproposed in the literature for the construction of DNA codes and constraints, it is still a challenge to design the optimal DNAcodes. Classifying the DNA codes for specific application has opportunities to use it for real applications. Though researchershave worked on improvement of the bounds of DNA codes satisfying the set of constraints, designing DNA codes satisfyingmaximum number of constraints achieving bound is still a huge challenge. Better codes with higher length and distance can bedesigned by using other algebraic methods or computational methods improving the bounds and obtaining the bounds for themissing n and d can be achieved. Defining DNA code as mathematical structure with the possible operation is interesting areato explore in which different operation can be defined on DNA nucleotides that satisfies the desired properties for computation.There are attempts to develop DNA codes satisfying the thermodynamic constraints though it is a challenge to develop theDNA codes that fits perfectly for the practical application of DNA strands. One of the important research problem is to workon the optimality condition of the DNA codes. To simulate the process of the DNA code designing and practical protocolsinvolved in the DNA computation, it can be automated by developing a platform where DNA codes can be designed andsimulated to check with the performance and accuracy. With emerging area of algebraic coding and biological coding theory,these challenges can be investigated and resolved.
REFERENCES
[1] L. M. Adleman, “Molecular computation of solutions to combinatorial problems,” Science, vol. 266, no. 5187, pp. 1021–1024, 1994.[2] C. C. Maley, “DNA computation: theory, practice, and prospects,” Evolutionary computation, vol. 6, no. 3, pp. 201–229, 1998.[3] C. Calude and G. Paun, Computing with cells and atoms: an introduction to quantum, DNA and membrane computing. CRC Press, 2000.[4] M. Garzon, P. Neathery, R. Deaton, R. C. Murphy, D. R. Franceschetti, and S. Stevens Jr, “A new metric for DNA computing,” in Proceedings of the
2nd Genetic Programming Conference, 1997, pp. 472–278.[5] R. Deaton, M. Garzon, R. Murphy, J. Rose, D. Franceschetti, and S. E. Stevens Jr, “Reliability and efficiency of a DNA-based computation,” Physical
Review Letters, vol. 80, no. 2, p. 417, 1998.[6] G. Rozenberg and A. Salomaa, “DNA computing: new ideas and paradigms,” in Automata, Languages and Programming. Springer, 1999, pp. 106–118.[7] E. Winfree, F. Liu, L. A. Wenzler, and N. C. Seeman, “Design and self-assembly of two-dimensional DNA crystals,” Nature, vol. 394, no. 6693, pp.
539–544, 1998.[8] E. Winfree, “Algorithmic self-assembly of DNA,” Ph.D. dissertation, California Institute of Technology, 1998.[9] N. C. Seeman, “DNA nanotechnology: novel DNA constructions,” Annual review of biophysics and biomolecular structure, vol. 27, no. 1, pp. 225–248,
1998.[10] P. W. Rothemund, “Folding DNA to create nanoscale shapes and patterns,” Nature, vol. 440, no. 7082, pp. 297–302, 2006.[11] M. Arita, “Writing information into DNA,” in Aspects of Molecular Computing. Springer, 2004, pp. 23–35.
Page 11 of 19
n/d
23
45
67
89
424
62
515
31
632
043
164
27
135
256
3511
21
852
812
828
222
29
1354
273
6519
82
110
6451
245
4286
021
054
178
211
1440
524
5747
711
737
145
1294
6176
5913
614
784
1848
924
8729
1213
1672
6327
376
3974
924
206
6223
1476
8768
1921
9211
878
3712
796
208
4915
1646
240
4118
2125
670
6648
1600
410
109
1613
1744
0032
9360
055
376
5542
413
856
6476
243
1726
3555
2065
8720
097
520
9745
012
864
6060
579
1844
9331
8411
2322
8869
9624
7387
7243
632
4363
226
9119
4710
2080
2364
7760
7387
7273
8772
9225
211
542
3678
2075
6760
576
1894
3206
411
8223
6811
8062
4073
8520
3685
0411
452
2190
2912
6418
8416
000
2257
3824
1412
068
1767
7245
112
1114
822
1060
2158
336
2650
4952
3222
6078
7256
4345
617
6772
3534
9688
424
2313
8451
3088
6702
2208
043
2646
4810
8166
2427
0369
467
6312
1691
8224
1772
7988
6336
4431
9794
176
3464
3654
443
3556
1621
6314
0054
0646
413
5161
625
2130
0369
664
1120
4500
480
1039
9676
1299
844
4160
0552
1039
9676
2599
688
2625
3215
7069
312
6330
3860
8384
3268
9356
861
8544
192
3865
6528
8320
4800
2080
1200
2715
8680
7889
9220
0574
4240
1148
8440
1148
8428
4206
1705
2487
6810
5154
2631
2192
5154
6808
3210
2623
4777
664
1396
736
1599
8771
262
7776
2926
2550
0086
272
6691
200
3853
632
3060
9973
8846
1056
015
2493
4612
6848
080
7665
6640
014
9011
4515
2093
1317
6480
2332
6094
4090
8001
6
Tabl
eX
II:
Low
erbo
unds
onD
NA
code
ssa
tisfy
ing
GC
and
Rev
erse
Com
plem
ent
Con
stra
ints
for
4≤n≤
30
and
2≤d≤
9
Page 12 of 19
n/d
1011
1213
1415
1617
1819
2021
2223
2425
2627
2829
304 5 6 7 8 9 10
211
21
124
22
1310
42
1421
84
22
1537
1820
32
116
8368
265
22
217
175
6230
124
22
118
407
133
4921
104
22
219
960
285
9939
188
42
21
2028
6876
617
977
3315
74
22
221
2926
847
364
8843
2211
63
22
122
2208
855
2217
474
3620
106
22
22
2342
968
1070
133
612
657
3116
84
22
21
2433
8016
8496
480
690
244
102
5127
147
42
22
225
6499
2216
2986
9614
0248
019
083
6523
126
42
22
126
5199
376
1299
844
848
2974
977
351
148
6738
2011
64
21
22
2710
0291
5025
0664
484
863
0819
2765
526
211
456
3218
95
41
12
128
1802
2633
620
0565
8415
3613
688
3987
1310
459
194
9350
2815
84
41
22
229
3877
7664
1938
8832
2929
282
4525
9989
835
315
577
4224
127
43
22
21
3091
1054
470
8168
7755
8760
6127
017
677
5426
1767
546
266
127
6536
2011
74
32
22
231
1502
6688
032
3005
3376
033
5833
9512
034
1166
8031
1035
2268
7716
7036
4537
5433
40
Tabl
eX
III:
Low
erbo
unds
onD
NA
code
ssa
tisfy
ing
GC
and
Rev
erse
Com
plem
ent
Con
stra
ints
for
4≤n≤
36
and
10≤d≤
30
Page 13 of 19
n/d
23
45
67
89
1011
1213
1415
1617
1819
204
326
25
116
324
26
512
6228
42
719
6819
642
122
28
8192
620
128
3016
22
928
87!
1952
346
8022
82
210
2786
!80
6420
1649
612
017
82
211
2677
!23
565
4832
607
136
4015
62
212
2734
!26
09!
3264
040
3220
1612
031
124
22
1365
536
6528
054
6920
1624
070
2410
42
214
>=
107
3264
052
3776
3264
081
9220
1651
212
032
84
22
1510
4755
210
4755
265
280
1638
440
3210
2411
864
176
32
216
>=
107
>=
107
8386
560
1305
6032
640
8192
480
120
120
325
22
217
>=
107
6528
016
384
6528
016
384
679
197
6864
124
22
218
>=
107
>=
107
5237
7620
9510
420
9150
416
384
8064
2016
143
120
2210
42
22
19>
=10
726
1888
1048
604
1310
7232
512
8128
1095
321
109
4219
84
22
220
>=
107
>=
107
>=
107
5237
7620
9510
410
4652
852
3776
3225
615
9820
1648
083
3516
74
22
2
Tabl
eX
IV:
Low
erbo
unds
onD
NA
code
ssa
tisfy
ing
Ham
min
gdi
stan
cean
dR
ever
seC
ompl
emen
tC
onst
rain
tsfo
r4≤n≤
20
and
2≤d≤
20
Page 14 of 19
[12] A. G. D’yachkov, P. L. Erdos, A. J. Macula, V. V. Rykov, D. C. Torney, C.-S. Tung, P. A. Vilenkin, and P. S. White, “Exordium for DNA codes,”Journal of Combinatorial Optimization, vol. 7, no. 4, pp. 369–379, 2003.
[13] D. Limbachiya and M. K. Gupta, “Natural data storage: A review on sending information from now to then via nature,” arXiv preprint arXiv:1505.04890,2015.
[14] A. Marathe, A. E. Condon, and R. M. Corn, “On combinatorial DNA word design,” Journal of Computational Biology, vol. 8, no. 3, pp. 201–219,2001.
[15] S. Hussini, L. Kari, and S. Konstantinidis, “Coding properties of DNA languages,” in DNA Computing. Springer, 2002, pp. 57–69.[16] M. K. Gupta, “The quest for error correction in biology,” Engineering in Medicine and Biology Magazine, IEEE, vol. 25, no. 1, pp. 46–53, 2006.[17] J. Watada et al., “DNA computing and its applications,” in Intelligent Systems Design and Applications, 2008. ISDA’08. Eighth International Conference
on, vol. 2. IEEE, 2008, pp. 288–294.[18] X. Wang, Z. Bao, J. Hu, S. Wang, and A. Zhan, “Solving the sat problem using a DNA computing algorithm based on ligase chain reaction,” Biosystems,
vol. 91, no. 1, pp. 117–125, 2008.[19] M. Darehmiraki, “A semi-general method to solve the combinatorial optimization problems based on nanocomputing,” International Journal of
Nanoscience, vol. 9, no. 05, pp. 391–398, 2010.[20] A. J. Hartemink, D. K. Gifford, and J. Khodor, “Automated constraint-based nucleotide sequence selection for DNA computation,” Biosystems, vol. 52,
no. 1, pp. 227–235, 1999.[21] J. Li, Q. Zhang, R. Li, and S. Zhou, “Optimization of DNA encoding based on combinatorial constraints,” ICIC Express Letters, vol. 2, no. 1, pp.
81–88, 2008.[22] R. Penchovsky and J. Ackermann, “DNA library design for molecular computation,” Journal of Computational Biology, vol. 10, no. 2, pp. 215–229,
2003.[23] E. B. Baum, “DNA sequences useful for computation,” in Proceedings of DNA-based Computers II, Princeton. In AMS DIMACS Series, vol. 44, 1999,
pp. 235–241.[24] A. G. D’yachkov, P. A. Vilenkin, I. K. Ismagilov, R. S. Sarbaev, A. Macula, D. Torney, and S. White, “On DNA codes,” Problems of Information
Transmission, vol. 41, no. 4, pp. 349–367, 2005.[25] O. Milenkovic and N. Kashyap, “On the design of codes for DNA computing,” in Coding and Cryptography. Springer, 2006, pp. 100–119.[26] M. H. Garzon, V. Phan, S. Roy, and A. J. Neel, “In search of optimal codes for DNA computing,” in DNA Computing. Springer, 2006, pp. 143–156.[27] L. P. Dinu and A. Sgarro, “A low-complexity distance for DNA strings,” Fundamenta Informaticae, vol. 73, no. 3, pp. 361–372, 2006.[28] A. D’yachkov, A. Macula, V. Rykov, and V. Ufimtsev, “DNA codes based on stem similarities between DNA sequences,” in DNA Computing. Springer,
2007, pp. 146–151.[29] J. Sager and D. Stefanovic, “Designing nucleotide sequences for computation: a survey of constraints,” in DNA Computing. Springer, 2006, pp.
275–289.[30] J. Sun, “Bounds on edit metric codes with combinatorial DNA constraints,” Master’s thesis, Brock University, 2010.[31] P. P. Debata, D. Mishra, K. Shaw, and S. Mishra, “A coding theoretic model for error-detecting in DNA sequences,” Procedia Engineering, vol. 38,
pp. 1773–1777, 2012.[32] D. Ashlock, S. K. Houghten, J. A. Brown, and J. Orth, “On the synthesis of DNA error correcting codes,” Biosystems, vol. 110, no. 1, pp. 1–8, 2012.[33] L. C. Faria, A. S. Rocha, J. H. Kleinschmidt, M. C. Silva-Filho, E. Bim, R. H. Herai, M. E. Yamagishi, and R. Palazzo Jr, “Is a genome a codeword
of an error-correcting code?” PloS one, vol. 7, no. 5, p. e36644, 2012.[34] Y. M. Chee and S. Ling, “Improved lower bounds for constant GC-content DNA codes,” Information Theory, IEEE Transactions on, vol. 54, no. 1,
pp. 391–394, 2008.[35] D. H. Smith, N. Aboluion, R. Montemanni, and S. Perkins, “Linear and nonlinear constructions of DNA codes with Hamming distance d and constant
GC-content,” Discrete Mathematics, vol. 311, no. 13, pp. 1207–1219, 2011.[36] D. Tulpan, D. H. Smith, and R. Montemanni, “Thermodynamic post-processing versus GC-content pre-processing for DNA codes satisfying the
Hamming distance and reverse-complement constraints,” IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), vol. 11,no. 2, pp. 441–452, 2014.
[37] M. A. Bishop, A. G. D’yachkov, A. J. Macula, T. E. Renz, and V. V. Rykov, “Free energy gap and statistical thermodynamic fidelity of DNA codes,”Journal of Computational Biology, vol. 14, no. 8, pp. 1088–1104, 2007.
[38] A. G. D’yachkov, A. J. Macula, W. K. Pogozelski, T. E. Renz, V. V. Rykov, and D. C. Torney, “A weighted insertion-deletion stacked pair thermodynamicmetric for DNA codes,” in DNA Computing. Springer, 2005, pp. 90–103.
[39] Q. Zhang, B. Wang, and X. Wei, “Evaluating the different combinatorial constraints in DNA computing based on minimum free energy,” MATCHCommun. Math. Comput. Chem, vol. 65, pp. 291–308, 2011.
[40] D. Tulpan, M. Andronescu, S. B. Chang, M. R. Shortreed, A. Condon, H. H. Hoos, and L. M. Smith, “Thermodynamically based DNA strand design,”Nucleic acids research, vol. 33, no. 15, pp. 4951–4964, 2005.
[41] Q. Zhang, B. Wang, X. Wei, X. Fang, and C. Zhou, “DNA word set design based on minimum free energy,” NanoBioscience, IEEE Transactions on,vol. 9, no. 4, pp. 273–277, 2010.
[42] S. Yazdi, Y. Yuan, J. Ma, H. Zhao, and O. Milenkovic, “A rewritable, random-access DNA-based storage system,” arXiv preprint arXiv:1505.02199,2015.
[43] D. C. Tulpan, “Effective heuristic methods for DNA strand design,” Ph.D. dissertation, The University Of British Columbia, 2006.[44] U. Feldkamp, H. Rauhe, and W. Banzhaf, “Software tools for DNA sequence design,” Genetic Programming and Evolvable Machines, vol. 4, no. 2,
pp. 153–171, 2003.[45] U. Feldkamp, W. Banzhaf, H. Rauhe et al., “A DNA sequence compiler,” in Poster presented at Sixth International Meeting on DNA Based Computers,
Leiden, 2000.[46] U. Feldkamp, S. Saghafi, W. Banzhaf, and H. Rauhe, “DNAsequencegenerator: A program for the construction of DNA sequences,” in DNA Computing.
Springer, 2001, pp. 23–32.[47] J. M. Gray, T. G. Frutos, A. M. Berman, A. E. Condon, M. G. Lagally, L. M. Smith, and R. M. Corn, “Reducing errors in DNA computing by
appropriate word design,” University of Wisconsin, Department of Chemistry, 1996.[48] P. Hansen, N. Mladenovic, and J. A. M. Perez, “Variable neighbourhood search: methods and applications,” Annals of Operations Research, vol. 175,
no. 1, pp. 367–407, 2010.[49] S. Kawashimo, H. Ono, K. Sadakane, and M. Yamashita, “DNA sequence design by dynamic neighborhood searches,” in DNA Computing. Springer,
2006, pp. 157–171.[50] R. Montemanni and D. H. Smith, “Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm,” Journal of
Mathematical Modelling and Algorithms, vol. 7, no. 3, pp. 311–326, 2008.[51] S. Kawashimo, H. Ono, K. Sadakane, and M. Yamashita, “Dynamic neighborhood searches for thermodynamically designing DNA sequence,” in DNA
Computing. Springer, 2008, pp. 130–139.[52] R. Montemanni, D. Smith, and N. Koul, “Three metaheuristics for the construction of constant GC-content DNA codes,” Lecture Notes in Management
Science, vol. 6, pp. 167–175, 2014.[53] Q. Qiu, D. Burns, Q. Wu, and P. Mukre, “Hybrid architecture for accelerating DNA codeword library searching,” in Computational Intelligence and
Bioinformatics and Computational Biology, 2007. CIBCB’07. IEEE Symposium on. IEEE, 2007, pp. 323–330.
Page 15 of 19
[54] N. Bennenni, K. Guenda, and A. Gulliver, “Construction of codes for DNA computing by the greedy algorithm,” ACM Commun. Comput. Algebra,vol. 49, no. 1, pp. 14–14, Jun. 2015. [Online]. Available: http://doi.acm.org/10.1145/2768577.2768583
[55] O. D. King, “Bounds for DNA codes with constant GC-content,” Electron. J. Combin, vol. 10, no. 1, p. 33, 2003.[56] D. C. Tulpan, H. H. Hoos, and A. E. Condon, “Stochastic local search algorithms for DNA word design,” in DNA Computing. Springer, 2003, pp.
229–241.[57] R. Deaton, M. Garzon, R. Murphy, and J. R. Koza, “Genetic search of reliable encodings for DNA-based computation,” Late Breaking Papers at the
Genetic Programming 1996, pp. 9–15, 1996.[58] M. Arita and S. Kobayashi, “DNA sequence design using templates,” New Generation Computing, vol. 20, no. 3, pp. 263–277, 2002.[59] W. Liu, S. Wang, L. Gao, F. Zhang, and J. Xu, “DNA sequence design based on template strategy,” Journal of chemical information and computer
sciences, vol. 43, no. 6, pp. 2014–2018, 2003.[60] S. Kobayashi, T. Kondo, and M. Arita, “On template method for DNA sequence design,” in DNA Computing. Springer, 2003, pp. 205–214.[61] R. Selvakumar, “Unconventional construction of DNA codes: group homomorphism,” Journal of Discrete Mathematical Sciences and Cryptography,
vol. 17, no. 3, pp. 227–237, 2014.[62] P. Gaborit and O. D. King, “Linear constructions for DNA codes,” Theoretical Computer Science, vol. 334, no. 1, pp. 99–113, 2005.[63] N. Aboluion, D. H. Smith, and S. Perkins, “Linear and nonlinear constructions of DNA codes with Hamming distance d, constant GC-content and a
reverse-complement constraint,” Discrete Mathematics, vol. 312, no. 5, pp. 1062–1075, 2012.[64] L. Faria, A. Rocha, J. Kleinschmidt, R. Palazzo, and M. Silva-Filho, “DNA sequences generated by BCH codes over GF (4),” Electronics letters,
vol. 46, no. 3, pp. 202–203, 2010.[65] T. Abualrub, A. Ghrayeb, and X. N. Zeng, “Construction of cyclic codes over GF(4) for DNA computing,” Journal of the Franklin Institute, vol. 343,
no. 4, pp. 448–457, 2006.[66] J. Cannon, A. Steel et al., “The MAGMA computational algebra system,” Software available online (magma. maths. usyd. edu. au), 2005.[67] A. Heck and W. Koepf, Introduction to MAPLE. Springer-Verlag New York, 1993, vol. 1993.[68] A. Niema, “The construction of DNA codes using a computer algebra system,” Ph.D. dissertation, PhD Thesis, University of Glamorgan, 2011.[69] A. S. L. Rocha, L. C. B. Faria, J. H. Kleinschmidt, R. Palazzo Jr, and M. C. Silva-Filho, “DNA sequences generated by Z4 linear codes,” in Information
Theory Proceedings (ISIT), 2010 IEEE International Symposium on. IEEE, 2010, pp. 1320–1324.[70] B. Feng, S. Bai, B. Chen, and X. Zhou, “The constructions of DNA codes from linear self-dual codes over Z4,” International Conference on Computer
Information Systems and Industrial Applications, 2015.[71] Z. Varbanov, T. Todorov, and M. Hristova, “A method for constructing DNA codes from additive self-dual codes over GF (4).” in Proc. CAIM conference,
Romania, vol. 40, 2014.[72] S. E. Oztas and I. Siap, “Lifted polynomials over F16 and their applications to DNA codes,” Filomat, vol. 27, no. 3, pp. 459–466, 2013.[73] B. Yildiz and I. Siap, “Cyclic codes over F2[u]/(u4 − 1) and applications to DNA codes,” Computers & Mathematics with Applications, vol. 63,
no. 7, pp. 1169–1176, 2012.[74] K. Guenda, T. A. Gulliver, and P. Sole, “On cyclic DNA codes,” in Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on.
IEEE, 2013, pp. 121–125.[75] J. Liang and L. Wang, “On cyclic DNA codes over F2 + uF2,” Journal of Applied Mathematics and Computing, pp. 1–11, 2015.[76] S. Pattanayak and A. K. Singh, “On cyclic DNA codes over the ring Z4 + uZ4,” arXiv preprint arXiv:1508.02015, 2015.[77] F. Ma, Y. Cao, and J. Gao, “On cyclic DNA codes over F4[u]/(u2 +1),” International Journal of Research and Reviews in Applied Sciences, vol. 24,
no. 3, p. 101, 2015.[78] S. Zhu and X. Chen, “Cyclic DNA codes over F2 + uF2 + vF2 + uvF2,” arXiv preprint arXiv:1508.07113, 2015.[79] A. Bayram, E. S. Oztas, and I. Siap, “Codes over F4 + vF4 and some DNA applications,” Designs, Codes and Cryptography, pp. 1–15, 2015.[80] B. Srinivasulu and M. Bhaintwal, “Reversible cyclic codes over F4 +uF4 and their applications to DNA codes,” in 2015 7th International Conference
on Information Technology and Electrical Engineering (ICITEE). IEEE, 2015, pp. 101–105.[81] N. Bennenni, K. Guenda, and S. Mesnager, “New DNA cyclic codes over rings,” arXiv preprint arXiv:1505.06263, 2015.[82] I. Siap, T. Abualrub, and A. Ghrayeb, “Cyclic DNA codes over the ring F2[u]/(u2 − 1) based on the deletion distance,” Journal of the Franklin
Institute, vol. 346, no. 8, pp. 731–740, 2009.[83] H. Mostafanasab and A. Y. Darani, “On cyclic DNA codes over F2 + uF2 + u2F2,” arXiv preprint arXiv:1603.05894, 2016.[84] K. Guenda and T. A. Gulliver, “Construction of cyclic codes over F2 + uF2 for DNA computing,” arXiv preprint arXiv:1207.3385, 2012.[85] A. Dertli and Y. Cengellenmis, “On cyclic DNA codes over the rings z {4}+ wz {4} and z {4}+ wz {4}+ vz {4}+ wvz {4},” arXiv preprint
arXiv:1605.02968, 2016.[86] H. Hong, L. Wang, H. Ahmad, J. Li, Y. Yang, and C. Wu, “Construction of DNA codes by using algebraic number theory,” Finite Fields and Their
Applications, vol. 37, pp. 328–343, 2016.[87] A. J. Ruben, S. J. Freeland, and L. F. Landweber, “Punch: An evolutionary algorithm for optimizing bit set selection,” in DNA Computing. Springer,
2001, pp. 150–160.[88] Z. Ignatova, I. Martinez-Perez, and K.-H. Zimmermann, DNA computing models. Springer Science & Business Media, 2008.[89] Q. Zhang and B. Wang, “On the bounds of DNA coding with h-distance,” MATCH Commun. Math. Comput. Chem, vol. 66, pp. 371–380, 2011.[90] L. M. Smith, R. M. Corn, A. E. Condon, M. G. Lagally, A. G. Frutos, Q. Liu, and A. J. Thiel, “A surface-based approach to DNA computation,”
Journal of computational biology, vol. 5, no. 2, pp. 255–267, 1998.[91] A. G. Frutos, Q. Liu, A. J. Thiel, A. M. W. Sanner, A. E. Condon, L. M. Smith, and R. M. Corn, “Demonstration of a word design strategy for DNA
computing on surfaces,” Nucleic Acids Research, vol. 25, no. 23, pp. 4748–4757, 1997.[92] Q. Liu, L. Wang, A. G. Frutos, A. E. Condon, R. M. Corn, and L. M. Smith, “DNA computing on surfaces,” Nature, vol. 403, no. 6766, pp. 175–179,
2000.[93] H. Wu, “An improved surface-based method for DNA computation,” Biosystems, vol. 59, no. 1, pp. 1–5, 2001.[94] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, “Quantitative monitoring of gene expression patterns with a complementary DNA microarray,”
Science, vol. 270, no. 5235, pp. 467–470, 1995.[95] S. Brenner and R. A. Lerner, “Encoded combinatorial chemistry.” Proceedings of the National Academy of Sciences, vol. 89, no. 12, pp. 5381–5383,
1992.[96] J. Reif, H. Chandran, N. Gopalkrishnan, and T. LaBean, “Self-assembled DNA nanostructures and DNA devices,” Nanofabrication Handbook, pp.
299–328, 2012.[97] Y. Yang, Y. Liu, and H. Yan, “DNA nanostructures as programmable biomolecular scaffolds,” Bioconjugate chemistry, 2015.[98] G. Jacob and A. Murugan, “DNA based cryptography: An overview and analysis,” Int J Emerg Sci, vol. 3, no. 1, pp. 27–36, 2013.[99] D. Tulpan, C. Regoui, G. Durand, L. Belliveau, and S. Leger, “Hyden: a hybrid steganocryptographic approach for data encryption using randomized
error-correcting DNA codes,” BioMed research international, vol. 2013, 2013.[100] A. Aich, A. Sen, S. R. Dash, and S. Dehuri, “A symmetric key cryptosystem using DNA sequence with OTP key,” in Information Systems Design and
Intelligent Applications. Springer, 2015, pp. 207–215.[101] X. Wei, L. Guo, Q. Zhang, J. Zhang, and S. Lian, “A novel color image encryption algorithm based on DNA sequence operation and hyper-chaotic
system,” Journal of Systems and Software, vol. 85, no. 2, pp. 290–299, 2012.
Page 16 of 19
[102] R. Enayatifar, A. H. Abdullah, and I. F. Isnin, “Chaos-based image encryption using a hybrid genetic algorithm and a DNA sequence,” Optics andLasers in Engineering, vol. 56, pp. 83–93, 2014.
[103] M. Arita, “Method for designing DNA codes used as information carrier,” May 27 2004, uS Patent App. 10/558,502.[104] G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital information storage in DNA,” Science, vol. 337, no. 6102, pp. 1628–1628, 2012.[105] N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust, B. Sipos, and E. Birney, “Towards practical, high-capacity, low-maintenance information
storage in synthesized DNA,” Nature, vol. 494, no. 7435, pp. 77–80, 2013.[106] S. Yazdi, H. M. Kiah, E. R. Garcia, J. Ma, H. Zhao, and O. Milenkovic, “DNA-based storage: Trends and methods,” arXiv preprint arXiv:1507.01611,
2015.[107] S. Tsaftaris, A. K. Katsaggelos, T. N. Pappas, E. T. Papoutsakis et al., “DNA computing from a signal processing viewpoint,” Signal Processing
Magazine, IEEE, vol. 21, no. 5, pp. 100–106, 2004.[108] L. Qian and E. Winfree, “Scaling up digital circuit computation with DNA strand displacement cascades,” Science, vol. 332, no. 6034, pp. 1196–1201,
2011.[109] M. H. Garzon and T.-Y. Wong, “DNA chips for species identification and biological phylogenies,” Natural Computing, vol. 10, no. 1, pp. 375–389,
2011.[110] J. Dingel and O. Milenkovic, “Coding-theoretic methods for reverse engineering of gene regulatory networks,” in Information Theory Workshop, 2008.
ITW’08. IEEE. IEEE, 2008, pp. 114–118.[111] D. G. Arques and C. J. Michel, “A complementary circular code in the protein coding genes,” Journal of theoretical biology, vol. 182, no. 1, pp. 45–58,
1996.[112] D. G. Arques and C. J. Michel, “A code in the protein coding genes,” BioSystems, vol. 44, no. 2, pp. 107–134, 1997.[113] C. J. Michel, “A 2006 review of circular codes in genes,” Computers & Mathematics with Applications, vol. 55, no. 5, pp. 984–988, 2008.[114] M. H. Garzon, “Theory and applications of DNA codeword design,” in Theory and Practice of Natural Computing. Springer, 2012, pp. 11–26.[115] L. V. Bystrykh, “Generalized DNA barcode design based on Hamming codes,” PloS one, vol. 7, no. 5, p. e36852, 2012.
APPENDIX ABOUNDS ON DNA CODES
1) Johnson type Bound- This bound is derived by shortening the code to length n− 1. The code is modified by choosingthe codewords with a fix character b ∈ Zq at the ith position where i ∈ {1 . . . n} and deleting ith position from thecodewords.For 0 ≤ d ≤ n and 0 < w < n,
a) AGC4 (n, d, w) ≤ b 2nw A
GC4 (n− 1, d, w − 1)c [55].
b) AGC4 (n, d, w) ≤ b 2n
n−wAGC4 (n− 1, d, w)c [55].
c) AR4 (n, d) ≤ b 14A
R4 (n− 1, d)c [88].
2) Halving Bounds-a) This is motivated by the fact DNA code CDNA and reverse DNA code C R
DNA are disjoint. Let n ≥ 1 be an integer.
For each integer d with 0 < d ≤ n, AR4 (n, d) ≤ 1
2A4(n, d) [88].
b) For 0 < d ≤ n and 0 ≤ w ≤ n, AGC,RC4 (n, d, w) ≤ 1
2AGC4 (n, d, w) [88].
c) For 0 < d ≤ n and 0 ≤ w ≤ n,AGC,R4 (n, d, w) ≤ 1
2AGC4 (n, d, w) [88].
3) Gilbert-Type Bounds-AGC4 (n, d, w) is derived by dividing the total number of DNA strings of length n with GC-content
w by the number of DNA string with distance d− 1 from the fixed codeword.a) For 0 ≤ d ≤ n and 0 ≤ w ≤ n,
AGC4 (n, d, w) ≥
(nw
)2w2n−w∑d−1
r=0
∑minbr/2c,w,n−wi=0
(wi
)(n−w
i
)(n−2ir−2i
)22i
[88].
b) For 0 ≤ d ≤ n and 0 ≤ w ≤ n,
AGC4 (n, d, w) ≥
(nw
)2n∑d−1
r=0
∑minbr/2c,w,n−wi=0
(wi
)(n−w
i
)(n−2ir−2i
)22i
[55].
4) This bound is based on the aspect of GC-content of given DNA codeword is equal to reverse of DNA codeword [55].For 0 ≤ d ≤ n and 0 ≤ w ≤ n,
a) AGC,RC4 (n, d, w) = AGC,R
4 (n, d, w) if n is even.
b) AGC,RC4 (n, d, w) ≤ AGC,R
4 (n+ 1, d+ 1, w) if n is odd.
c) AGC,R4 (n, d+ 1, w) ≤ AGC,R
4 (n, d, w) ≤ AGC,R4 (n, d− 1, w) if n is odd.
5) By using AGC4 (n, d, w) ≥ A2(n, d, w) ·A2(n, d) inequality, following bound is derived [55].
For 0 ≤ w ≤ n, AGC4 (n, 2, w) =
(n
w
)2n−1.
Page 17 of 19
6) By using Halving bound AGC,RC4 (n, d, w) ≤ 1
2AGC4 (n, d, w) for d = 2 one can calculate the bound [55].
For 0 ≤ w ≤ n and n is even,
AGC,RC4 (n, 2, w) =
(n
w
)2n−2.
7) Bound AGC4 (n, d, w) is computed by dividing the total number of words with GC-content w that are at distance at least
d from their reverse-complements by the number of these codewords that are at distance at most d− 1 from any fixedcodeword [55].
For 0 ≤ d ≤ n and 0 ≤ w ≤ n,
AGC4 (n, d, w) ≥
∑nr=d V (n, d, r)
2∑d−1
r=0
∑minbr/2c,w,n−wi=0
(wi
)(n−w
i
)(n−2ir−2i
)22i
8) Product Bounds - This is based on the construction of the DNA code AGC4 (n, d, w) with length n, minimum Ham-
ming distance d and GC-content w from binary constant-weight codes A2(n, d, w) and ternary constant-weight codesA3(n, d, w) with length n, Hamming weight w and minimum Hamming distance d [55] [88].For 0 ≤ d ≤ n and 0 ≤ w ≤ n,
a) AGC4 (n, d, w) ≥ A2(n, d, w) ·A2(n, d)
b) AGC,R4 (n, d, w) ≥ AR
2 (n, d, w) ·A2(n, d)
c) AGC,R4 (n, d, w) ≥ A2(n, d, w) ·AR
2 (n, d)
d) AGC4 (n, d, w) ≥ A3(n, d, w) ·A2(n− w, d)
e) AGC,R4 (n, d, w) ≥ AR
3 (n, d, w) ·A2(n− w, d)
f) AGC,R4 (n, d, w) ≥ A3(n, d, w) ·AR
2 (n− w, d)
g) AR4 (n, d) ≥ AR
2 (n, d) ·A2(n, d).
9) Reverse and Reverse complement codes bounds - This can be simple observed by the reverse and reverse complementproperty of the DNA codeword [88] [14].Let n ≥ 1 be an integer,
a) If n is even, then ARC4 (n, d) = AR
4 (n, d) and
b) If n is odd, then ARC4 (n, d) ≤ AR
4 (n+ 1, d+ 1)
10) Special cases - Bounds are observed by considering different combination of length n and GC-content w [14].For n > 0, with 0 ≤ d ≤ n and 0 ≤ w ≤ n,
a) AGC4 (n, d, 0) = A2(n, d)
b) AGC4 (n, d, w) = AGC
4 (n, d, n− w)
c) AGC4 (n, n,w) = 4 if w = n/2
3 if n ≤ w < n/2 or n/2 < w ≤ 2n/32 if w < n/3 or w > 2n/3
d) AGC,RC
4 (n, n,w) = {2 if w = n/21 if w 6= n/2
}e) AGC
4 (n, 1, w) =
(n
w
)2n
Page 18 of 19
11) Bounds on reverse code for d = 3 - This bound is computed from the concept of sphere-packing bound for codes [14].
ARq (n, 3) ≤
qdn/2e∑i = 2bn/2c
(bn/2ci
)(q − 1)i
2(1 + 4(q − 2) + (n− 4)(q − 1)).
12) Bounds on reverse code of size S - By using Greedy approach to calculate size of the code, bound is derived [14].
Let V (s, d) be the number of words of S such that they have distance d from s where sεS.
a) ARq (n, d) ≤ |S|
2V −(b(d− 1)/2c)where V −(d) = min{V (s, d)|sεS}.
b) ARq (n, d) ≥ |S|
2V +d− 1where, V +(d) = max{V (s, d)|sεS}.
13) Bounds on reverse code for d = 2 - Bounds on reverse code for d = 2 is obtained by claims• Any two words from the same subset differ in at least two positions ie. dH(xDNA, yDNA) ≥ 2 ∀ xDNA, yDNA ∈ CDNA.• If a word belongs to a subset, its reversal is also in the same subset ie. if xDNA ∈ CDNA then xR
DNA ∈ CDNA
• All the qn/2 palindromes are in the same subset.
a) ARq (n, 2) =
qn−1
2, for even n and qε{2, 4}, and
b) ARq (n, 2) =
qn−1 − qbn/2c
2, for odd n and qε{2, 4}
14) Doubling Construction - This is motivated by the minimum Hamming distance between the DNA codeword, itsreverse codeword and revere complement. It is observed from the property of DNA code with dH(xDNA, yDNA) ≥ d,HDNA(xR
DNA, yDNA) ≥ d, HDNA(xDNA, yCDNA) ≥ d and HDNA(xDNA, yRC
DNA) ≥ d
For n ≥ 2, AR2 (2n, 2n−1) = 2n [14].
15) a) Bounds on even and odd length n reverse code - In this bounds, maximum size of reverse code of even and oddlength n and relationship between reverse code of even and odd length n− 1 is demonstrated [14].
ARq (n− 1, d) ≤ AR
q (n, d) ≤ ARq (n, d− 1) and
b) ARq (n− 1, d) ≥ AR
q (n, d)/q, for odd n.16) Bounds on Hamming distance constraint- V.Phan provided lower and upper bounds of DNA codeword sets which satisfy
the h-distance constraint [89].4n−d+1
d(
nd−1) ≤ |S| ≤ 4n
Page 19 of 19