1 the art of dna strings: sixteen years of dna coding theorythe art of dna strings: sixteen years of...

19
1 The Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai Ambani Institute of Information and Communication Technology Gandhinagar, Gujarat, 382007 India Email:[email protected], bansari s [email protected] and [email protected] Abstract The idea of computing with DNA was given by Tom Head in 1987, however in 1994 in a seminal paper, the actual successful experiment for DNA computing was performed by Adleman. The heart of the DNA computing is the DNA hybridization, however, it is also the source of errors. Thus the success of the DNA computing depends on the error control techniques. The classical coding theory techniques have provided foundation for the current information and communication technology (ICT). Thus it is natural to expect that coding theory will be the foundational subject for the DNA computing paradigm. For the successful experiments with DNA computing usually we design DNA strings which are sufficiently dissimilar. This leads to the construction of a large set of DNA strings which satisfy certain combinatorial and thermodynamic constraints. Over the last 16 years, many approaches such as combinatorial, algebraic, computational have been used to construct such DNA strings. In this work, we survey this interesting area of DNA coding theory by providing key ideas of the area and current known results. I. I NTRODUCTION Figure 1: DNA structure with its four nucleotides A-Adenine, G-Guanine, C-Cytosine and T - Thymine. These are the basic build- ing blocks of DNA which are held by Hydrogen bonds in a double helical manner. Each DNA base is paired with the complementary bases such that the yellow band is A which is connected to its comple- mentary base T (green band) while C (blue band) is paired with red G (red band) Information and Communication Technology (ICT) has come a long way in the last 60 years. We have now connected world of devices talking to each other or performing the computation. This is possible due to the advancement of coding and information theory. In 1994, two new directions in ICT emerged viz. quantum computing and DNA computing. The heart of any computing or communications is coding theory. Thus two new directions in coding theory also emerged such as quantum coding theory and DNA coding theory. This paper focuses on the later area giving a comprehensive overview of techniques and challenges of DNA coding theory from last 16 years. In 1994, L.Adleman performed the computation using DNA strands to solve an instance of the Hamiltonian path problem giving birth to DNA computing [1]–[4]. DNA is a double strand built by the pairing the four basic building units A-(Adenine), C-(Cytosine), G-(Guanine), T-(Thymine) which are called nucleotides. The DNA strand is held by the important feature called complementary base pairing which connects the Watson Crick complementary bases with each other denoted by A C = T and G C = C. The backbone of DNA is an alternating chains of sugar and phosphate. DNA is an ideal source of computing due to its stable, dense and its self replicating property [5]. After the successful experiment by Adleman, area of DNA computing [6] flourished into different directions like DNA tile assembly [7] [8], building of DNA nano-structures [9] [10], studying error correcting properties in DNA sequences [11] [12] and DNA based data storage system [13]. But it was only with the identification of mathematical properties in DNA sequences [14] [15], it inspired many researchers to explore this new amalgamation of biology and coding theory [16]. DNA computing [17] promises to solve many NP-complete problems like Hamilton path problem, satisfiability problem (SAT) [18] and many others with massive parallelization [19]. To perform computation using DNA strands, a specific set of DNA sequences are required with particular properties. Such set of DNA strands satisfying various constraints [20], [21] play an important role in biomolecular computation [22] [23]. The aim of this manuscript is to elucidate the current state of art of DNA codes as shown in Figure 2, DNA codes constraints and especially the new methods contributing to construct DNA codes from algebraic coding. It summarizes the research on DNA codes which may help the researcher to get a broad picture of the area. Also, it gives brief overview on the applications of DNA codes. The Paper is organized as follows. Section II and III discusses the DNA code design problem and constraints on DNA codes. Section IV describes various approaches for the construction of DNA codes. Section V gives a brief description on software tools for DNA sequence design. Section VI includes description on bounds of DNA codes. Section VII gives a brief note on applications of DNA codes. Section VIII shows results on DNA codes for 0 n 36 and 0 d 20. Section IX concludes the paper with general remarks. arXiv:1607.00266v1 [cs.IT] 1 Jul 2016

Upload: others

Post on 24-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

1

The Art of DNA Strings: Sixteen Years of DNACoding Theory

Dixita Limbachiya, Bansari Rao and Manish K. GuptaDhirubhai Ambani Institute of Information and Communication Technology

Gandhinagar, Gujarat, 382007 IndiaEmail:[email protected], bansari s [email protected] and [email protected]

Abstract

The idea of computing with DNA was given by Tom Head in 1987, however in 1994 in a seminal paper, the actual successfulexperiment for DNA computing was performed by Adleman. The heart of the DNA computing is the DNA hybridization, however,it is also the source of errors. Thus the success of the DNA computing depends on the error control techniques. The classicalcoding theory techniques have provided foundation for the current information and communication technology (ICT). Thus itis natural to expect that coding theory will be the foundational subject for the DNA computing paradigm. For the successfulexperiments with DNA computing usually we design DNA strings which are sufficiently dissimilar. This leads to the constructionof a large set of DNA strings which satisfy certain combinatorial and thermodynamic constraints. Over the last 16 years, manyapproaches such as combinatorial, algebraic, computational have been used to construct such DNA strings. In this work, we surveythis interesting area of DNA coding theory by providing key ideas of the area and current known results.

I. INTRODUCTION

Figure 1: DNA structure withits four nucleotides A-Adenine,G-Guanine, C-Cytosine and T -Thymine. These are the basic build-ing blocks of DNA which are heldby Hydrogen bonds in a doublehelical manner. Each DNA baseis paired with the complementarybases such that the yellow band isA which is connected to its comple-mentary base T (green band) whileC (blue band) is paired with red G(red band)

Information and Communication Technology (ICT) has come a long way in the last 60years. We have now connected world of devices talking to each other or performing thecomputation. This is possible due to the advancement of coding and information theory. In1994, two new directions in ICT emerged viz. quantum computing and DNA computing.The heart of any computing or communications is coding theory. Thus two new directionsin coding theory also emerged such as quantum coding theory and DNA coding theory. Thispaper focuses on the later area giving a comprehensive overview of techniques and challengesof DNA coding theory from last 16 years. In 1994, L.Adleman performed the computationusing DNA strands to solve an instance of the Hamiltonian path problem giving birth toDNA computing [1]–[4]. DNA is a double strand built by the pairing the four basic buildingunits A-(Adenine), C-(Cytosine), G-(Guanine), T-(Thymine) which are called nucleotides.The DNA strand is held by the important feature called complementary base pairing whichconnects the Watson Crick complementary bases with each other denoted by AC = T andGC = C. The backbone of DNA is an alternating chains of sugar and phosphate. DNA isan ideal source of computing due to its stable, dense and its self replicating property [5].After the successful experiment by Adleman, area of DNA computing [6] flourished intodifferent directions like DNA tile assembly [7] [8], building of DNA nano-structures [9][10], studying error correcting properties in DNA sequences [11] [12] and DNA based datastorage system [13]. But it was only with the identification of mathematical properties inDNA sequences [14] [15], it inspired many researchers to explore this new amalgamation ofbiology and coding theory [16]. DNA computing [17] promises to solve many NP-completeproblems like Hamilton path problem, satisfiability problem (SAT) [18] and many others with massive parallelization [19]. Toperform computation using DNA strands, a specific set of DNA sequences are required with particular properties. Such set ofDNA strands satisfying various constraints [20], [21] play an important role in biomolecular computation [22] [23].

The aim of this manuscript is to elucidate the current state of art of DNA codes as shown in Figure 2, DNA codes constraintsand especially the new methods contributing to construct DNA codes from algebraic coding. It summarizes the research onDNA codes which may help the researcher to get a broad picture of the area. Also, it gives brief overview on the applicationsof DNA codes.

The Paper is organized as follows. Section II and III discusses the DNA code design problem and constraints on DNA codes.Section IV describes various approaches for the construction of DNA codes. Section V gives a brief description on softwaretools for DNA sequence design. Section VI includes description on bounds of DNA codes. Section VII gives a brief note onapplications of DNA codes. Section VIII shows results on DNA codes for 0 ≤ n ≤ 36 and 0 ≤ d ≤ 20. Section IX concludesthe paper with general remarks.

arX

iv:1

607.

0026

6v1

[cs

.IT

] 1

Jul

201

6

Page 2: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

II. DNA CODE DESIGN PROBLEM

To perform the DNA computation, DNA strands react with each other by Watson Crick base pairing and form perfectmatch. But in some situation, DNA strands may not form perfect base pairing and react in undesirable manner. One situationis formation of secondary structure in which first half strand of the DNA strand forms complementary with its own otherhalf forming the hair pin like structure. This kind of structure interrupt the desired computation. Such secondary structureshave to be avoided by designing the DNA strands carefully. Also DNA strands may bind to another DNA strands forming thecomplementary base pairing with few base pairs creating an error. One way to avoid this is to ensure that every two DNAstrands differ in more than d locations where d depends on the application of DNA computation. This property can be obtainedby defining different distance like the Hamming distance, Lee distance, Edit distance, Deletion distance etc. between two DNAstrands.

DNA codes design problem [24]–[26] is to develop a set of the DNA codewords of length n over DNA alphabets ΣDNA ={A, T,G,C} with predefined distance d [27] [28] and satisfying maximum set of constraints [29]. The main objective is tofind the largest possible set of codewords M of length n over alphabet ΣDNA = {A,C,G, T} feasible with respect to set ofconstraints [30] with distance d that enable the construction of better error correcting codes [31]–[33].

Figure 2: DNA codes Time line: The state of art of DNA codes is showcased from 1994 to 2016. Different work mentionedwith authors in chronological order.

Definition (DNA code). A DNA code CDNA(n,M, d) ⊂ ΣDNA = {A, T,G,C} with each DNA codeword of length n andsize M and minimum distance d. Here A denotes Adenine, T denotes Thymine, G denotes Guanine and C denotes Cytosineas the nucleotides in DNA.

III. CONSTRAINTS ON DNA CODES

There are different types of constraints that DNA codes must satisfy. Three categories in which these constraints can beclassified are as follows:

1) Combinatorial constraints2) Thermodynamic constraints3) Application Oriented constraintsThe constraints which should be followed by DNA codes are listed below :1) Hamming distance constraint (n, d, w) -

The Hamming distance constraint can be defined as dH(xDNA, yDNA) ≥ d ∀ xDNA, yDNA ∈ CDNA for some Hammingdistance d [14]. A set of codewords with length n, size M and minimum Hamming distance d satisfying Hammingconstraint is denoted by CDNA(n,M, d). Hamming distance is calculated by the total number of places at which twoDNA codewords differ. For CDNA(n,M, d), the minimum of the distances is considered as the Hamming distance.Aq(n, d) denotes the maximum size of a code with codewords of length n and distance d over alphabet size q, in caseof DNA codewords q = 4.Example 1. Let xDNA = ATGACT and yDNA = ACTAGC, then dH(xDNA, yDNA) = 4 where xDNA, yDNA ∈ CDNA

following the Hamming distance constraint with dH(xDNA, yDNA) ≥ 3.2) Reverse constraint(n, d) - The reverse constraint is HDNA(xR

DNA, yDNA) ≥ d ∀ xDNA, yDNA ∈ CDNA. A code satisfyingreverse constraint is called reverse code. AR

q (n, d) denotes the maximum size of a reverse code with length n andminimum Hamming distance d.

Page 2 of 19

Page 3: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

Figure 3: DNA codes constraints classified as three categories combinatorial, thermodynamics and application orientedconstraints.

Example 2. Let xDNA = ATGACT , xRDNA = TCAGTA and yDNA = ATACAT . For n = 6 and d = 3, HDNA(xR

DNA, yDNA) ≥3. xDNA, yDNA are reverse code.

3) Reverse Complement constraint (n, d) - This RC-constraint is HDNA(xRDNA, yC

DNA) ≥ d ∀ xDNA, yDNA ∈ CDNA. DNAcode satisfying RC-constraint is called a reverse complement code. ARC

q (n, d) denotes the maximum size of a reversecomplement code with length n and minimum Hamming distance d [14].Example 3. Let xDNA = ATGACT , xR

DNA = TCAGTA and yDNA = GTACAC, yCDNA = CATGTG. For n = 6 and

d = 3, HDNA(xRDNA, yC

DNA) = 4 ≥ 3. xDNA, yDNA are reverse complement code.4) GC-content constraint (n, d, w) - The set of codewords with length n, distance d and GC weight w ,where w is total

number of Gs and Cs present in the DNA strand viz. wxDNA= |{xi : xDNA = (xi) , xi ∈ {C,G}}| [34]–[36]. Generally

w = bn/2c.Example 4. Let xDNA = ATTGCT then xDNA /∈ CDNA for n = 6 and w = 3.

5) Melting temperature constraint (n, d, w)- Melting temperature Tm is a temperature at which half of the DNA strands arehybridized and half are not. Melting is opposite of hybridization in which two strands get separated. It is advantageousto have the codewords with the similar melting temperatures as it will enable the hybridization of multiple DNA strandssimultaneously. Hence we select the codewords with similar melting temperature calculated by Nussinov’s algorithm[25]. For each xDNA ∈ CDNA have identical melting temperature [29] .

6) Thermodynamic constraint (n, d, w) - For the DNA stability, it is necessary for all the DNA codes should have comparablefree energy ∆G◦ [37]–[39] above some threshold [40], [41]. As DNA with minimum free energy is more stable and hencewe consider only those DNA codewords in a DNA code that have approximately comparable free energy in the set ofDNA codewords. Free Energy ∆G◦ for the given DNA codewords xDNA, yDNA ∈ CDNA can be obtained by the equation,

|∆G◦(xDNA)−∆G◦(yDNA)| ≤ δ

where δ > 0 is a constant.7) Uncorrelated-correlated constraint (n, d, w) - A codeword ∈ CDNA if shifted by x units where x ≤ n should not match

with any of the other codeword ∈ CDNA [42].

Example 5. Let XDNA = CATCATC and YDNA = ATCATCGG. X ◦ Y = 0100100, as depicted below.

X = C A T C A T CY = A T C A T C G G 0Y = A T C A T C G G 1Y = A T C A T C G G 0Y = A T C A T C G G 0Y = A T C A T C G G 1Y = A T C A T C G G 0Y = A T C A T C G G 0

Page 3 of 19

Page 4: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

The constraints (1) to (3) are used to avoid undesirable hybridization between different DNA strands and (4) to (6) arethe DNA constraints which ensures that all the codewords have similar thermodynamic characteristics to perform uniformcomputation.

IV. VARIOUS APPROACHES FOR THE DNA CODES CONSTRUCTIONS

There are different approaches [43] for the construction of the DNA codes with finite length n, defined distance d and set ofconstraints with respect to application. The set of constraints essential for the DNA codes is subject to the application. Thereexist algorithmic, theoretic and software simulation method [44] [45] [46] approaches to design the DNA codes. Optimality ofthe DNA code can be obtained by construction of the DNA codes in a way that every codeword in the set follows maximumnumber of constraints for a large value of n and large minimum distance d with minimum errors in DNA computation [47].

Figure 4: DNA codes construction Methods are classified as Search algorithms, Algebraic Coding methods and Softwaresimulations. Search algorithms include Variable neighborhood search, Simulated Annealing, Stochastic Local Search. TemplateBased method uses a template to design a DNA codeword. Algebraic coding method is construction of DNA codes by usingAlgebraic structures like Fields and Rings. Altruistic Approach is construction of DNA codewords in altruistic way in thiswork.

1) Variable Neighborhood Search Approach: This approach employs different local search algorithms to search the DNAcodes [48]–[51]. Some of the search algorithms are described here.

a) Seed Building(SB) : Seed Building (SB) algorithm examines all the possible codewords randomly with respect toseed codeword where seed codewords are the initial set of codewords with the required constraint. For more detailson the algorithm, [52] can be referred. The drawback of the approach is that it is inefficient for larger values of nbecause of the computational and time complexity invested for the development of feasible set of codewords.Example 6. Consider the n = 3, d = 1 and GC-content w = 2 and initial seed be codewords with d = 1 andw = 2 are AGG and AGC then the resulting codewords at first iteration with HD and GC-content constraints areGGC,GCA,GTC,CCG,CGC,CTC, TGG. Note that ACC,GCT,CAG are also codewords with GC-content2 but at distance d = 3 so will be added in the next iteration for seed d = 3.

b) Clique Search : In this method a random subset of the codewords of a given code is removed, leaving a partialcode. All the codewords are generated and those compatible with the ones left in the code are identified and agraph is built where the identified codewords are nodes, and an edge exists between compatible codewords.Example 7. Let n = 4, d = 3 and w = 2 then the partial code be CTTC,CGAA, TGGT,GTGA with HD, RCand GC constraints. The clique search will result in codewords CACT,GCTT,AGTG and AAGC.

c) Hybrid Search: This method combines two approaches of seed building and clique search. It uses seed building togenerate the partial code and clique search to search for the best codewords [53].

d) Greedy Approach: These types of algorithms removes the worst DNA code at each iteration from a set of codesevery time the algorithm is iterated. But the problem with this approach is that it doesn’t always produce the bestresults. In the process of removing the worst code at each stage it may remove a potentially good code at earlierstages and may not remove bad code at the later stages. So it doesn’t always produce the best optimal result [54].Example 8. Let n = 3, d = 2 and w = 2 then codewords from greedy search AGG,ACC,GGC,GCA,CCG,CTC.

e) Lexicographic Approach: These types of algorithms take into consideration a particular arrangement of the DNAcodewords. It may be an alphabetical order or ascending order or any such kind of ordered arrangement of codessuch that the arrangement of the DNA codewords satisfying the DNA constraints are ordered in a specific manner[55].

2) Simulated Annealing Approach: Simulated Annealing is a meta heuristic algorithm derived from thermodynamicprinciples [52]. These types of algorithms work on the set of codewords in which not all the codewords in the given set

Page 4 of 19

Page 5: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

satisfies the constraints specified. These algorithms attempts to change the codewords with the objective of reducing theconstraint violations to the constraints and try to make them feasible. The set with a feasible solution is derived whenthere is no violation of constraints.

3) Stochastic Local Search Approach: We search for the codewords with parameters (n, d, w = n/2) such that the lengthof the codewords is n and the GC-content is bn/2c and the minimum Hamming distance is atleast d. These types ofalgorithms work on random set of codewords while solving the problem. It generally considers an initial set of random kcodewords and then removes the one which doesn’t satisfy the constraints. For more details on the the algorithm, readercan refer to [56].Example 9. Let random codewords k = 16 then set is AA,AT,AG,AC,GA,GT,GC,GG,CT,CA,CC,CG, TA, TC, TG, TT . Codewords with n = 2, d = 2 and w = 1 are GC,AG,CA, TC.

4) Genetic Algorithm: In this work, genetic algorithm is used to search efficient and reliable codes by minimizing themis-hybridization error. This codewords generated were unique in terms of Hamming distance that satisfy the Hammingbound [57].

5) Template Based Method: This method was introduced in [58]–[60] that involves two step process. Initially a templateis designed which is mapped to binary error correcting codes and combination of template and codeword results intoDNA codewords. This method is not optimal because the code size is limited depending of the size of error correctingcode used and selection of template. Mapping of the template to the codewords is subjected to specific application.Example 10. Let template t = 1001101 and a codeword c = 0010111. Suppose the map define 11 → A, 10 → T ,01 → G and 00 → C for position of 1 and 0 in t is 1 = [AT ] and 0 = [GC] followed by position in codewords thenthe DNA codes will be TCGTAGA.

6) Algebraic Coding Approach: By using the algebraic coding, DNA codes are constructed from fields and rings bymapping the elements of the field and rings to the DNA nucleotides [61].

a) Codes over Fields:i) Linear codes over GF (4) : In this approach, DNA codewords are constructed from GF (4) by using different

one-to-one mapping the elements of GF (4) to DNA nucleotides [62]. The mapping is preferred from {0,1,ω, ω2} to {A,C,G,T} with respect to the codes used. There linear code [35] and additive codes.The methodused have improved the lower bounds on GC constraints and extended the result on the length of DNA code ton ≤ 30. Researchers extended this construction for non linear codes and cyclic codes [63]. Also DNA codes overGF (4) were constructed using BCH codes in [64] in which the protein and targeting sequences are identifiedas codewords of error-correcting BCH codes.

ii) Linear and Additive codes over GF (4): In this paper, DNA codes are constructed considering linear codesand additive codes over GF (4) pf odd length following Hamming distance constraint and reverse complementconstraint. In [65], DNA codes of length 7, 9, 11 and 13 have been considered. DNA nucleotides {A,C,G, T}have been mapped to 0, ω, ω and 1 respectively with ω = ω2 and ω2 + ω + 1 = 0. Each codeword has beenmapped to a polynomial and the Trace map Tr : GF (4)→ GF (2) is stated as :

Tr(x) = x+ x2

iii) Extended, Additive, Additive Extended Cyclic codes over GF (4): In referred work, DNA codes satisfying GC-content constraint and a minimum Hamming distance constraint were constructed using computer algebra systemsMagma [66] and Maple [67]. Longer codes of higher length 4 ≤ n ≤ 30 were derived from GF (4), additivecodes over GF (4) and Z4 (see Figure 5). Moreover it was claimed that by using different mapping from fields orrings to DNA codewords can result into different lower bounds. Further the bounds on the DNA codes satisfyingset of constraints were ameliorated by shortening and puncturing of obtained codes [68].

b) Codes over Rings : Algebraic construction of DNA codes was further extended to codes over rings. Different ringsare used to construct DNA codes by mapping rings elements to DNA nucleotides.i) DNA sequences generated by Z4 linear codes: In [69], a biological coding system which modeled the existence

of error correcting codes in the DNA structure. Model consists of an encoder and modulator. The encoder consistof a mapper that converts the DNA nucleotides to elements of and BCH codes over Z4) and a modulator consistof a genetic code, tRNA (transfer RNA that serves as the connecting link between the mRNA (messenger RNA)to the amino acid sequence of proteins) and ribosome (protein synthesizer of the cell) which is associated withsignals that convert the genetic codons to protein. The DNA and protein coding sequences from different specieshave been identified as the codewords over linear codes over Z4. A class of error correcting-code BCH codeswith parameters (n, k, 3) have been used in the encoder to construct the DNA codes over Z4.

In [70] [71], the self dual codes over Z4 are used for construction for DNA codes. Additionally, GC weightenumerator of the DNA codes that determines the number of Gs and Cs in the codeword. GC wright enumeratorhelps in the construction of DNA codes satisfying GC- content constraint. Self dual DNA codes over Z4 are

Page 5 of 19

Page 6: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

Cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 30.

Extended cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 30.

Additive cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 19.

Additive extended cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 20.

Cyclic DNA codes over Z4 have been computed for 4 ≤ n ≤ 24.

Extended cyclic DNA codes over Z4 have been computed for 4 ≤ n ≤ 24.

Cosets of cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 20.

Cosets of extended cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 20.

Cosets of additive cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 14.

Cosets of additive extended cyclic DNA codes over GF (4) have been computed for 4 ≤ n ≤ 15.

Cosets of cyclic DNA codes over Z4 have been computed for 4 ≤ n ≤ 20.

Cosets of extended cyclic DNA codes over Z4 have been computed for 4 ≤ n ≤ 30.

Figure 5: DNA codes construction methods using Computer Algebra Systems such as Maple and Magma [68] .

developed by using mapping as A → 0 , C → 1, T → 2 and G → 3. The following generator matrix G overZ4 is considered and let K4 denote the codeword over Z4 formed from G. There are 16 codewords and wa(c)where a ∈ Z4 and k ∈ K4 as shown in Table I.

G =

1 1 1 10 2 0 20 0 2 2

K4 w0 w1 w2 w3 K4 w0 w1 w2 w3

(0000) 4 0 0 0 (1111) 0 4 0 0(2222) 0 0 4 0 (3333) 0 0 0 4(0202) 2 0 2 0 (1313) 0 2 0 2(2020) 2 0 2 0 (3131) 0 2 0 2(0022) 2 0 2 0 (1133) 0 2 0 2(2200) 2 0 2 0 (3311) 0 2 0 2(0220) 2 0 2 0 (1331) 0 2 0 2(2002) 2 0 2 0 (3113) 0 2 0 2

Table I: (4, 16, 3) code is generated by G where wa(c) where a ∈ Z4 and k ∈ K4 [70].

ii) Lifted Polynomials over F16: In [72], reversible codes by using special family of polynomials denoted as liftedpolynomials over F4 which generates the reversible codes of odd length over F16 are constructed. 4-liftedpolynomial is used to generate the DNA code of even length by using the correspondence between pair ofDNA nucleotides to elements of the ring F16. Table II preserves the property that if DNA pair is mapped to anelement of F16 then reverse of that DNA pair is mapped to fourth power of the element of F16. For example,α2 → GC then (α2)4 → CG.

iii) DNA codes over F2[u]/u4 − 1: In [73], cyclic DNA codes of odd length are obtained from F2[u]/(u4 − 1) ={a + bu + cu2 + du3 | a, b, c, d ∈ F2} where u4 = 1 commutative ring is considered. All its ideals are listedhere 〈0〉 = 〈(1 + u)4〉 ⊂ 〈(1 + u)3〉 ⊂ 〈(1 + u)2〉 ⊂ 〈(1 + u)〉 ⊂ R. The 16 elements of ring R are mapped to2-length nucleotides. The following mapping shown in Table III was considered in the paper [73]. This mappingpreserved the complementary and reverse property by adding 1 + u+ u2 + u3 and multiplying u2 respectively.To find complement of AA, we add 1 + u+ u2 + u3 to 0, it will give 1 + u+ u2 + u3 = TT . To find reverseAA, multiply 0 to u2 will result in 0 =AA.In [74] DNA cyclic codes of arbitrary length satisfying the reverse complement constraint are constructed byusing additive stem distance. The correspondence between the ring elements and DNA is established by followingmapping mentioned in Table IV. this preserves the reverse complement property of the DNA codewords by the

Page 6 of 19

Page 7: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

Sr.No DNA Pair a Multiplicative(F16) Additive1. AA 0 -2. TT α0 13. AT α1 α4. GC α2 α2

5. AG α3 α3

6. TA α4 1 + α7. CC α5 α+ α2

8. AC α6 α2 + α3

9. GT α7 1 + α+ α3

10. CG α8 1 + α2

11. CA α9 α+ α3

12. GG α10 1 + α+ α2

13. CT α11 α+ α2 + α3

14. GA α12 1 + α+ α2 + α3

15. TG α13 1 + α2 + α3

16. TC α14 1 + α3

Table II: Mapping from DNA nucleotide Pair to element of F16. [72]

Element Map Element Map Element Map Element MapAA 0 AT 1 + u GT 1 CT 1+ u + u2

TT 1 + u + u2 + u3 TA u2 + u3 TG u2 TC 1 + u2 + u3

GG 1 + u2 GC u+ u2 AC 1 + u+ u3 AG uCC u+ u3 CG 1 + u3 CA u+ u2 + u3 GA u3

Table III: Mapping from DNA Nucleotide Pair to elements of the Ring F2 + uF2 + u2F2 + u3F2 where u4 = 1 [73].

x+ xc = u3 + u2 + u+ 1. The reverse of the DNA code is obtained by multiplying u2 to any element x of thering R.

Element Map Element Map Element Map Element MapGG 0 AT 1 + u GT 1 CT 1+ u + u2

CC 1 + u + u2 + u3 TA u2 + u3 TG u2 TC 1 + u2 + u3

GC 1 + u2 AA u+ u2 AC 1 + u+ u3 AG uCG u+ u3 TT 1 + u3 CA u+ u2 + u3 GA u3

Table IV: Mapping from DNA Nucleotide Pair to elements of the Ring F2 + uF2 + u2F2 + u3F2 where u4 = 1 [73].

iv) DNA cyclic codes over F2 + uF2 where u2 = 0: DNA codes of even length following reverse and reversecomplement constraints have been studied in [75]. The field F2 is a subring of R. A linear code C of lengthn over R is defined to be an additive submodule of the R-module Rn. A cyclic code of length n over Ris a linear code with the property that if (c0, c1, . . . , cn−1) ∈ C then (cn−1, c0, . . . , cn−2) ∈ C. An n-tuplec = (c0, c1, . . . , cn−1) ∈ Rn is identified with the polynomial c0 + c1x + . . . + cn−1x

n−1 in the ring Rn =R[x]/ (xn − 1), which is called the polynomial representation of c = (c0, c1, ..., cn−1). Here R = F2 +uF2 with{0, u, 1, 1 + u} elements are in one to one correspondence with nucleotides A, T,G and C such that 0 → A,u → T , u+ 1 → C and 1 → G. The DNA codes of length 8 and 10 are obtained from C = g6 + u

(x5 + x

)and C = g1g

22 + ug2, ug1g2. Necessary and sufficient conditions for cyclic codes to follow the reverse and

reverse-complement properties have also have been studied. To preserve the reverse complement constraint ofind complement of A, we add u to 0, it will give u = T.

v) DNA codes over Z4+uZ4: DNA cyclic codes of odd lengths following reverse and reverse complement constraintare constructed in [76]. Here ring Z4+uZ4 = a+ ub | a, b ∈ Z4 with u2 = 0 is considered. Reversible and cyclicreversible complement codes are discovered in this paper. In this work defined a Gray map that allows them totranslate the properties of DNA codes to binary codewords i.e. φ : Z4+uZ4 → Z2

4 such that φ(a+ub) = (b, a+b)where a, b ∈ Z4. In this paper, 16 pairs of nucleotides which are mapped as shown in Table V.

Element Map Element Map Element Map Element MapAA 0 TT 1+u GG 1 CC uAT 2 TA 3+u GC 3 CG 2+uGT 2u CA 1+3u AC 3u TG 1+2uCT 2+3u GA 3+2u AG 2+2u TC 3+3u

Table V: Mapping from DNA nucleotide pair to element of the Ring Z4 + uZ4 where u2 = 0 [76].

vi) F4[u]/ < u2 + 1 > where u2 = 1: In [77] self-reciprocal complement cyclic codes from R with F4 ={0, 1, α, α+ 1} where α is the root of primitive polynomial x2 + x+ 1 over F2 are studied. The DNA code ofspecific length 6 over the ring is considered. One to one correspondence is establish between pairs of nucleotides

Page 7 of 19

Page 8: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

and 16 elements of the ring. DNA cyclic codes constructed followed reverse complement, GC content andHamming distance constraints. The basis is {1, u+ 1} and then every element of R is expressed in the form ofa+ b(u+ 1) where a,b ∈ F4. The mapping is done as in Table VI

Element Map Element Map Element Map Element MapAA 0 TT α+ 1(α+ 1)u GT 1 CA α+ (α+ 1)uAG α TC 1 + (α+ 1)u AT α+ 1 TA (α+ 1)uTG u AC α+ 1 + αu GA αu CT 1 + α+ uGC 1 + αu CG α+ u CC 1 + u GG α+ αu

Table VI: Mapping between the pair of DNA nucleotides and Ring F4 [77].

vii) DNA codes from F2 + uF2 + vF2 + uvF2 with u2 = 0 and v2 = v: The structure of cyclic DNA codes of anarbitrary length over R2 = F2 +uF2 +vF2 +uvF2 was studied and the relation to codes over R1 = F2 +uF2 bydefining Gray map between R2 and R2

1 was established [78]. DNA codes following reverse, reverse complementconstraints are studied. The Gray map from R2 to R1 is defined as φ(a+ bv) = (a, a+ b). GC weight over thering was also introduced by using image of Gray map. One type of nontrivial automorphisms can be definedover R2 as follows : σ : F2 + uF2 + vF2 + uvF2 → F2 + uF2 + vF2 + uvF2,a+ bv → a+ (1 + v)b such that a, b ∈ F2 + uF2. Table is defined using : Φ (c) : C → S2n

D4,(a0 + b0v, a1 + b1v, . . . , an−1 + bn−1v) 7→ (a0, a1, . . . , an−1, a0 + b0, a1 + b1, . . . , an−1 + bn−1). Below is theTable VII for mapping elements of the ring to DNA described in the paper. For instance, (c0, c1, c2, c3) =(u+ v, u, v, 1) is mapped to TCTTAGTCGG.

Elements a Gray ImagesDoubleDNA Pairsζ(a)

Elements a Gray Images Double DNAPairs ζ(a)

0 (0,0) AA v (0,1) AGuv (0,u) AT v+uv (0,1+u) AC1 (1,1) GG 1+v (1,0) GA1+uv (1,1+u) GC 1+v+uv (1,u) GTu (u,u) TT u+v (u,1+u) TCu+uv (u,0) TA u+v+uv (u,1) TG1+u (1+u,1+u) CC 1+u+v (1+u,u) CT1+u+uv (1+u,1) CG 1+u+v+uv (1+u,0) CA

Table VII: Mapping between DNA pair and elements of the Ring F2 + uF2 + vF2 + uvF2 with u2 = 0 and v2 = v [78].

viii) codes over F4 + vF4: In linear, constacyclic and cyclic codes over the ring R = F4[v]/(v2− v) are constructedin [79]. The ring R = {a + vb|a, b ∈ F4} is non-chain finite semi-local Frobenius ring with 16 elements. The4 elements are F4 = {0, 1, ω, ω + 1} where ω2 = ω + 1. The Gray map from R to F4 × F4 is given by :

φ(c) = (a+ b, a).If a = (a1, a2, . . . , an) ∈ Rn, then the Hamming weight of a is the sum of the Hamming weights of itscomponents, i.e.w(a) =

∑ni=1 w(ai). The Hamming distance between a and b in R is d(a, b) = w(a− b). The

Lee weight of any element of R is the Gray image of its Hamming weight, i.e. wL(c) = wH(φ(c)). Below isthe Table VIII for mapping used in [80].

Elements Gray Images Double DNAPairs ξ(a) Elements Gray Images Double DNA

Pairs ξ(a)0 (0, 0) AA 1 (1, 1) TTω (ω, ω) CC 1 + ω (1 + ω, 1 + ω) GGv (1, 0) TA 1 + v (0, 1) ATv + ω (1 + ω, ω) GC 1 + v + ω (ω, 1 + ω) CGvω (ω, 0) CA 1 + vω (1 + ω, 1) GTω + vω (0, ω) AC 1 + ω + vω (1, 1 + ω) TGv + vω (1 + ω, 0) GA 1 + v + vω (ω, 1) CTω + v + vω (1, ω) TC 1 + v + ω + vω (0, 1 + ω) AG

Table VIII: Mapping between DNA Pair and elements of the Ring F4 + vF4 [79].

ix) R = F2[u]/(u6): In [81], DNA cyclic codes over a family R1 = F2[u]/(u6) and ring R2 = F2 + uF2 wherev2 = v satisfying reverse complement constraint have been constructed. In this a new family of DNA skewcyclic codes is introduced over ring R = F2 + vF2 = 0, 1, v, v + 1 where v2 = v. The ring R1 = F2[u]/(u6) ={a0 + a1u + a2u

2 + a3u3 + a4u

4 + a5u5; ai ∈ F2, u

6 = 0}. There is direct map between 64 elements of thering to 64 codons (three nucleotides) used in nature as a substrate for aminoacid synthesis shown in Table IX.

x) DNA codes over ring F2[u]/u2 − 1 : Ring R = F2[u]/u2 − 1 = {0, 1, u, 1 + u} where u2 = 1 was described[82] [83] . Elements of the ring were directly mapped to DNA nucleotides such that reverse complement

Page 8 of 19

Page 9: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

DNACodons Ring Element DNA

Codons Ring Element DNACodons Ring Element DNA

Codons Ring Element

CCC u5+u4+u3+u2+u+1 GGG 0 ACT u3 + u2 + u+ 1 GTC u4 + u2 + u+ 1GGA u5 + u4 + u3 + u2 + u CCT 1 ACG u3 + u2 + u ACA u3 + u2 + u+ 1GGC u5 + u4 + u3 + u2 +1 CCG u TTT u4 + u2 + 1 GAC u5 + u3 + u2 + 1GGT u5 + u4 + u3 + u2 CCA u+ 1 TTG u4 + u2 + u AGG u5 + u3 + u+ 1AGG u5 + u4 + u3 + u+ 1 TCC u2 CTA u4 + u+ 1 GAT u5 + u3 + u2

CGG u5 + u4 + u2 + u+ 1 GCC u3 GTT u4 + u3 + 1 GTA u4 + u3 + u+ 1GAG u5 + u3 + u2 + u+ 1 CTC u4 GTG u4 + u3 + u ATT u4 + u3 + u2 + 1AGA u5 + u4 + u3 + u2 + u TCT u2 + 1 TCA u2 + u+ 1 ATA u4 + u3 + u2 + uAGC u5 + u4 + u3 + 1 TCG u2 + u CAA u5 + u2 + u ATC u4 + u3 + u2

ATG u4 + u3 + u2 + u+ 1 TAC u5 CAC u5 + u2 + 1 TGA u5 + u4 + uAGT u5 + u4 + u3 TAT u5 + 1 GCA u3 + u+ 1 AAT u5 + u2 + u+ 1CGA u5 + u4 + u2 + u GCT u3 + 1 TTA u4 + u3 AAA u5 + u3 + uCGC u5 + u4 + u2 + 1 GCG u3 + u ACC u3 + u2 TGC u5 + u4 + 1CGT u5 + u4 + u2 TAA u5 + u CAT u5 + u2 AAC u5 + u3 + 1TGG u5 + u4 + u+ 1 CTG u4 + u TGT u5 + u4 TCC u4 + u2

GAA u5 + u4 + u3 + u2 + u CTT u4 + 1 CAG u5 + u3 TAG u5 + u+ 1

Table IX: Mapping between 64 elements of the Ring F2 + uF2 + u2F2 + u3F2 + u4F2 + u5F2 and 64 codons [81].

constraint is conserved. By adding 1 + u to elements of rings, complement can be obtained. For example0→ A, u→ C, 1→ G, 1 + u→ T preserves the complement property: 0 = 1 + u and u = 1.The elements of the ring R can be mapped to the elements of F2 = {0, 1} via the map θ where θ(0) =θ(u + 1) = 0 and θ(1) = θ(u) = 1. Let C be a cyclic code in Rn. It can be extended to the map θ to a mapϕ : C → Z2[x]/(xn−1) defined by ϕ(a0+a1x+a2x

2+. . .+an−1xn−1) = θ(a0)+θ(a1)x+. . .+θ(an−1x

n−1).xi) DNA cyclic codes over F2 + uF2 : In [84] odd length codes over rings satisfying reverse complement, GC-

Content and thermodynamic constraints are studied. They are obtained from the cyclic complement reversiblecode. Infinite family of BCH DNA codes are constructed. The mapping Φ called Gray Map has been used tomap linear codes over R to binary linear codes. The Gray map Φ is the distance-preserving map (Rn, Leedistance) → (F 2n

2 ,Hamming distance).xii) Cyclic DNA codes over Z4 + wZ4 : Recently, cyclic DNA codes over R = Z4 + wZ4 where w2 = 2 and

S = Z4 + wZ4 + vZ4 + wvZ4 where v2 = vw have been discovered [85]. In this work, odd length DNAcyclic codes over R satisfying reverse and reverse complement constraint is studied . Also, a family of DNAskew cyclic codes with reverse complement property over R is constructed. Binary images of the cyclic DNAcodes over R and S is determined. The correspondence between elements of ring R and double DNA pairs areestablish as described in the Table X.

Elements Gray Images Double DNAPairs ξ(a) Elements Gray Images Double DNA

Pairs ξ(a)0 (0, 0) AA 1 (1, 0) CA2 (2, 0) GA 3 (3, 0) TAω (0, 1) AC 2ω (0, 2) AG3ω (0, 3) AT 1 + ω (1, 1) CC1 + 2ω (1, 2) CG 1 + 3ω (1, 3) CT2 + ω (2, 1) GC 2 + 2ω (2, 2) GG2 + 3ω (2, 3) GT 3 + ω (3, 1) TC3 + 2ω (3, 2) TG 3 + 3ω (3, 3) TT

Table X: Mapping between DNA Pair and elements of the Ring.

7) Algebraic Number Theory codes: Aforementioned all the methods include construction of DNA codes from classicalcoding theory and heuristic approaches works well for small length n. In this author constructed the DNA codes usingalgebraic number theory [86], making the first attempt, by using irreducible cyclic codes to built DNA codes with largen (< 1000) and number of codewords (M < 7000) satisfying the GC content constraint.

V. SOFTWARE TOOLS FOR DNA CODES GENERATION

There are different tools developed for designing the DNA codewords namely DNA sequence Generator [44] and evolutionaryalgorithm based program-PUNCH (Princeton University Nucleotide Computing Heuristic) [87] were used to find set ofdissimilar sequences. DNA sequence generator and compiler use graph based approach based on the overlapping sub-sequences.The GUI of the software allows the user to import the the DNA sequence to the sequence wizard with different parameters. It

Page 9 of 19

Page 10: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

also calculate the melting temperature of the DNA sequences. It check for the reverse complement constraints and forbiddenDNA strands. However this software do not take care of secondary structure formation of the DNA strands. PUNCH is usedfor performing various DNA computing by bit set selection. It works on randomization by selecting three basic parameters N,B and V where N is number of bits in the problem, B is the number of nucleotides in each bit, and V is number of variationon each bit set.

Here the author has created a web application in which the user has to first select the mode which is either specific basedor range. For both modes, the user has to select the specific constraints he wants DNA codes to follow but for specific theinput has parameters n and d where n is the length of the codewords and d is the Hamming distance. For range, the inputparameters are n1 and n2 where n1 is the starting length and n2 is the ending length and also are d1 and d2 where d1 is thestarting Hamming distance and d2 is the ending Hamming distance. The web portal will display the number of codewords andalso those codewords that satisfy the constraint.

VI. BOUNDS ON DNA CODES

There are several bounds studied for DNA codes. This section comprehend the types of bounds obtained on set of constraints.In the Table XI, columns are ticked for which respective bounds on constraints are obtained. Note that n is the length, d isminimum Hamming distance and w is GC content of the DNA code. One can observe that almost all the methods haveobtained lower bounds on reverse constraint. But most important part is to investigate that there is no bounds on DNA codessatisfying the set of HD, GC, R and RC constraints altogether. Also there are very few attempts made to explore the boundson thermodynamic constraints. One can explore the methods which allows the formulation of bounds on the thermodynamicconstraints. More details on the bounds are described in Appendix A.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

AHD4 (n, d, w) X

AR4 (n, d, w) X X X X X X X X X X

ARC4 (n, d, w) X X

AGC4 (n, d, w) X X X X X

AR,RC4 (n, d, w)

ARC,GC4 (n, d, w) X X X X X X

AR,GC4 (n, d, w) X X

AR,RC,GC4 (n, d, w)

Table XI: Bounds on DNA codes: 1-Johnson type Bound [88] [55], 2-Halving Bounds [88] [55], 3-Gilbert-Type Bounds [55],4- Relation between AGC,RC

4 and AGC,R4 [55], 5- AGC

4 bound, 6- AGC,RC4 bound, 7- AGC

4 (n, d, w) bound, 8- Product Bounds[88] [55], 9- Relation between ARC

4 and AR4 [88], 10 to 16 - Relation between AGC

4 , ARC4 and AR

4 [14], 17-V. Phan bound[89] satisfying Hamming distance constraint.

VII. APPLICATIONS OF DNA CODES

DNA codes are used in various technologies like DNA computing [1], surface based DNA computation [90]–[93], DNAMicroarray technology [94], Molecular barcodes for chemical libraries [95], DNA nanostructures [96], [97], DNA origami

Page 10 of 19

Page 11: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

[10], data encrytption [98], [99] [100]–[102], data storage [13], [103]–[106], signal processing [107] and DNA nanodevicesand circuits [108]. Recently, it has reported potential of DNA codes in phylogenetic studies [109]. Also it has contributionin understanding gene regulatory networks [110], protein coding genes [111], [112] and studying the structure of genes viacircular codes [113].

In DNA computing, DNA codes with specific properties are required to perform various parallel and logical operations [114].Molecular barcodes [115] generated from DNA codes are used as biomarkers for authentication of the products. DNA codesare used in creating DNA nanostructures that are used in potential applications like targeted drug delivery systems. DNA codeswith specific properties with high stability and robustness are required for nano structures which can be achieved by usingefficient encoding procedures for DNA codes. Recently, DNA is used in data hiding techniques for encryption of data moreeffectively. DNA codes used in this are designed as encryption or decryption keys. In last few years DNA based data storagesystems have [42] received attention by many researchers. DNA codes used for data storage must have feasible property thatachieve dense data storage capacity and better error correction capacity.

To use DNA codes for any application, fundamental constraints mentioned for DNA codes are unavoidable. DNA codedesign must follow the constraint for stability and robustness but to design the DNA codes for specific application, requiredconstraints must be added to DNA codes to make it more functional and practical. For instance, correlated and uncorrelatedconstraint was added to DNA codes for development random and re-writable DNA based data storage system. Looking atthe potential of DNA codes and advancement in the biotechnology methods, DNA codes promises application in emergingtechnologies.

VIII. DNA CODES TABLE

Many tables on lower bounds of DNA codes satisfying set of constraints are obtained. In Table VIII and XIII lower boundson DNA codes satisfying GC and reverse complement (RC) constraints are mentioned. In Table XIV lower bounds on DNAcodes satisfying Hamming distance and reverse complement constraint. These bounds are compiled from [68] [36] [52] [62][71] [86].

IX. FUTURE WORK AND CHALLENGES

DNA codes designing have received a great deal of attention by researchers in the last decade. In spite of different approachesproposed in the literature for the construction of DNA codes and constraints, it is still a challenge to design the optimal DNAcodes. Classifying the DNA codes for specific application has opportunities to use it for real applications. Though researchershave worked on improvement of the bounds of DNA codes satisfying the set of constraints, designing DNA codes satisfyingmaximum number of constraints achieving bound is still a huge challenge. Better codes with higher length and distance can bedesigned by using other algebraic methods or computational methods improving the bounds and obtaining the bounds for themissing n and d can be achieved. Defining DNA code as mathematical structure with the possible operation is interesting areato explore in which different operation can be defined on DNA nucleotides that satisfies the desired properties for computation.There are attempts to develop DNA codes satisfying the thermodynamic constraints though it is a challenge to develop theDNA codes that fits perfectly for the practical application of DNA strands. One of the important research problem is to workon the optimality condition of the DNA codes. To simulate the process of the DNA code designing and practical protocolsinvolved in the DNA computation, it can be automated by developing a platform where DNA codes can be designed andsimulated to check with the performance and accuracy. With emerging area of algebraic coding and biological coding theory,these challenges can be investigated and resolved.

REFERENCES

[1] L. M. Adleman, “Molecular computation of solutions to combinatorial problems,” Science, vol. 266, no. 5187, pp. 1021–1024, 1994.[2] C. C. Maley, “DNA computation: theory, practice, and prospects,” Evolutionary computation, vol. 6, no. 3, pp. 201–229, 1998.[3] C. Calude and G. Paun, Computing with cells and atoms: an introduction to quantum, DNA and membrane computing. CRC Press, 2000.[4] M. Garzon, P. Neathery, R. Deaton, R. C. Murphy, D. R. Franceschetti, and S. Stevens Jr, “A new metric for DNA computing,” in Proceedings of the

2nd Genetic Programming Conference, 1997, pp. 472–278.[5] R. Deaton, M. Garzon, R. Murphy, J. Rose, D. Franceschetti, and S. E. Stevens Jr, “Reliability and efficiency of a DNA-based computation,” Physical

Review Letters, vol. 80, no. 2, p. 417, 1998.[6] G. Rozenberg and A. Salomaa, “DNA computing: new ideas and paradigms,” in Automata, Languages and Programming. Springer, 1999, pp. 106–118.[7] E. Winfree, F. Liu, L. A. Wenzler, and N. C. Seeman, “Design and self-assembly of two-dimensional DNA crystals,” Nature, vol. 394, no. 6693, pp.

539–544, 1998.[8] E. Winfree, “Algorithmic self-assembly of DNA,” Ph.D. dissertation, California Institute of Technology, 1998.[9] N. C. Seeman, “DNA nanotechnology: novel DNA constructions,” Annual review of biophysics and biomolecular structure, vol. 27, no. 1, pp. 225–248,

1998.[10] P. W. Rothemund, “Folding DNA to create nanoscale shapes and patterns,” Nature, vol. 440, no. 7082, pp. 297–302, 2006.[11] M. Arita, “Writing information into DNA,” in Aspects of Molecular Computing. Springer, 2004, pp. 23–35.

Page 11 of 19

Page 12: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

n/d

23

45

67

89

424

62

515

31

632

043

164

27

135

256

3511

21

852

812

828

222

29

1354

273

6519

82

110

6451

245

4286

021

054

178

211

1440

524

5747

711

737

145

1294

6176

5913

614

784

1848

924

8729

1213

1672

6327

376

3974

924

206

6223

1476

8768

1921

9211

878

3712

796

208

4915

1646

240

4118

2125

670

6648

1600

410

109

1613

1744

0032

9360

055

376

5542

413

856

6476

243

1726

3555

2065

8720

097

520

9745

012

864

6060

579

1844

9331

8411

2322

8869

9624

7387

7243

632

4363

226

9119

4710

2080

2364

7760

7387

7273

8772

9225

211

542

3678

2075

6760

576

1894

3206

411

8223

6811

8062

4073

8520

3685

0411

452

2190

2912

6418

8416

000

2257

3824

1412

068

1767

7245

112

1114

822

1060

2158

336

2650

4952

3222

6078

7256

4345

617

6772

3534

9688

424

2313

8451

3088

6702

2208

043

2646

4810

8166

2427

0369

467

6312

1691

8224

1772

7988

6336

4431

9794

176

3464

3654

443

3556

1621

6314

0054

0646

413

5161

625

2130

0369

664

1120

4500

480

1039

9676

1299

844

4160

0552

1039

9676

2599

688

2625

3215

7069

312

6330

3860

8384

3268

9356

861

8544

192

3865

6528

8320

4800

2080

1200

2715

8680

7889

9220

0574

4240

1148

8440

1148

8428

4206

1705

2487

6810

5154

2631

2192

5154

6808

3210

2623

4777

664

1396

736

1599

8771

262

7776

2926

2550

0086

272

6691

200

3853

632

3060

9973

8846

1056

015

2493

4612

6848

080

7665

6640

014

9011

4515

2093

1317

6480

2332

6094

4090

8001

6

Tabl

eX

II:

Low

erbo

unds

onD

NA

code

ssa

tisfy

ing

GC

and

Rev

erse

Com

plem

ent

Con

stra

ints

for

4≤n≤

30

and

2≤d≤

9

Page 12 of 19

Page 13: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

n/d

1011

1213

1415

1617

1819

2021

2223

2425

2627

2829

304 5 6 7 8 9 10

211

21

124

22

1310

42

1421

84

22

1537

1820

32

116

8368

265

22

217

175

6230

124

22

118

407

133

4921

104

22

219

960

285

9939

188

42

21

2028

6876

617

977

3315

74

22

221

2926

847

364

8843

2211

63

22

122

2208

855

2217

474

3620

106

22

22

2342

968

1070

133

612

657

3116

84

22

21

2433

8016

8496

480

690

244

102

5127

147

42

22

225

6499

2216

2986

9614

0248

019

083

6523

126

42

22

126

5199

376

1299

844

848

2974

977

351

148

6738

2011

64

21

22

2710

0291

5025

0664

484

863

0819

2765

526

211

456

3218

95

41

12

128

1802

2633

620

0565

8415

3613

688

3987

1310

459

194

9350

2815

84

41

22

229

3877

7664

1938

8832

2929

282

4525

9989

835

315

577

4224

127

43

22

21

3091

1054

470

8168

7755

8760

6127

017

677

5426

1767

546

266

127

6536

2011

74

32

22

231

1502

6688

032

3005

3376

033

5833

9512

034

1166

8031

1035

2268

7716

7036

4537

5433

40

Tabl

eX

III:

Low

erbo

unds

onD

NA

code

ssa

tisfy

ing

GC

and

Rev

erse

Com

plem

ent

Con

stra

ints

for

4≤n≤

36

and

10≤d≤

30

Page 13 of 19

Page 14: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

n/d

23

45

67

89

1011

1213

1415

1617

1819

204

326

25

116

324

26

512

6228

42

719

6819

642

122

28

8192

620

128

3016

22

928

87!

1952

346

8022

82

210

2786

!80

6420

1649

612

017

82

211

2677

!23

565

4832

607

136

4015

62

212

2734

!26

09!

3264

040

3220

1612

031

124

22

1365

536

6528

054

6920

1624

070

2410

42

214

>=

107

3264

052

3776

3264

081

9220

1651

212

032

84

22

1510

4755

210

4755

265

280

1638

440

3210

2411

864

176

32

216

>=

107

>=

107

8386

560

1305

6032

640

8192

480

120

120

325

22

217

>=

107

6528

016

384

6528

016

384

679

197

6864

124

22

218

>=

107

>=

107

5237

7620

9510

420

9150

416

384

8064

2016

143

120

2210

42

22

19>

=10

726

1888

1048

604

1310

7232

512

8128

1095

321

109

4219

84

22

220

>=

107

>=

107

>=

107

5237

7620

9510

410

4652

852

3776

3225

615

9820

1648

083

3516

74

22

2

Tabl

eX

IV:

Low

erbo

unds

onD

NA

code

ssa

tisfy

ing

Ham

min

gdi

stan

cean

dR

ever

seC

ompl

emen

tC

onst

rain

tsfo

r4≤n≤

20

and

2≤d≤

20

Page 14 of 19

Page 15: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

[12] A. G. D’yachkov, P. L. Erdos, A. J. Macula, V. V. Rykov, D. C. Torney, C.-S. Tung, P. A. Vilenkin, and P. S. White, “Exordium for DNA codes,”Journal of Combinatorial Optimization, vol. 7, no. 4, pp. 369–379, 2003.

[13] D. Limbachiya and M. K. Gupta, “Natural data storage: A review on sending information from now to then via nature,” arXiv preprint arXiv:1505.04890,2015.

[14] A. Marathe, A. E. Condon, and R. M. Corn, “On combinatorial DNA word design,” Journal of Computational Biology, vol. 8, no. 3, pp. 201–219,2001.

[15] S. Hussini, L. Kari, and S. Konstantinidis, “Coding properties of DNA languages,” in DNA Computing. Springer, 2002, pp. 57–69.[16] M. K. Gupta, “The quest for error correction in biology,” Engineering in Medicine and Biology Magazine, IEEE, vol. 25, no. 1, pp. 46–53, 2006.[17] J. Watada et al., “DNA computing and its applications,” in Intelligent Systems Design and Applications, 2008. ISDA’08. Eighth International Conference

on, vol. 2. IEEE, 2008, pp. 288–294.[18] X. Wang, Z. Bao, J. Hu, S. Wang, and A. Zhan, “Solving the sat problem using a DNA computing algorithm based on ligase chain reaction,” Biosystems,

vol. 91, no. 1, pp. 117–125, 2008.[19] M. Darehmiraki, “A semi-general method to solve the combinatorial optimization problems based on nanocomputing,” International Journal of

Nanoscience, vol. 9, no. 05, pp. 391–398, 2010.[20] A. J. Hartemink, D. K. Gifford, and J. Khodor, “Automated constraint-based nucleotide sequence selection for DNA computation,” Biosystems, vol. 52,

no. 1, pp. 227–235, 1999.[21] J. Li, Q. Zhang, R. Li, and S. Zhou, “Optimization of DNA encoding based on combinatorial constraints,” ICIC Express Letters, vol. 2, no. 1, pp.

81–88, 2008.[22] R. Penchovsky and J. Ackermann, “DNA library design for molecular computation,” Journal of Computational Biology, vol. 10, no. 2, pp. 215–229,

2003.[23] E. B. Baum, “DNA sequences useful for computation,” in Proceedings of DNA-based Computers II, Princeton. In AMS DIMACS Series, vol. 44, 1999,

pp. 235–241.[24] A. G. D’yachkov, P. A. Vilenkin, I. K. Ismagilov, R. S. Sarbaev, A. Macula, D. Torney, and S. White, “On DNA codes,” Problems of Information

Transmission, vol. 41, no. 4, pp. 349–367, 2005.[25] O. Milenkovic and N. Kashyap, “On the design of codes for DNA computing,” in Coding and Cryptography. Springer, 2006, pp. 100–119.[26] M. H. Garzon, V. Phan, S. Roy, and A. J. Neel, “In search of optimal codes for DNA computing,” in DNA Computing. Springer, 2006, pp. 143–156.[27] L. P. Dinu and A. Sgarro, “A low-complexity distance for DNA strings,” Fundamenta Informaticae, vol. 73, no. 3, pp. 361–372, 2006.[28] A. D’yachkov, A. Macula, V. Rykov, and V. Ufimtsev, “DNA codes based on stem similarities between DNA sequences,” in DNA Computing. Springer,

2007, pp. 146–151.[29] J. Sager and D. Stefanovic, “Designing nucleotide sequences for computation: a survey of constraints,” in DNA Computing. Springer, 2006, pp.

275–289.[30] J. Sun, “Bounds on edit metric codes with combinatorial DNA constraints,” Master’s thesis, Brock University, 2010.[31] P. P. Debata, D. Mishra, K. Shaw, and S. Mishra, “A coding theoretic model for error-detecting in DNA sequences,” Procedia Engineering, vol. 38,

pp. 1773–1777, 2012.[32] D. Ashlock, S. K. Houghten, J. A. Brown, and J. Orth, “On the synthesis of DNA error correcting codes,” Biosystems, vol. 110, no. 1, pp. 1–8, 2012.[33] L. C. Faria, A. S. Rocha, J. H. Kleinschmidt, M. C. Silva-Filho, E. Bim, R. H. Herai, M. E. Yamagishi, and R. Palazzo Jr, “Is a genome a codeword

of an error-correcting code?” PloS one, vol. 7, no. 5, p. e36644, 2012.[34] Y. M. Chee and S. Ling, “Improved lower bounds for constant GC-content DNA codes,” Information Theory, IEEE Transactions on, vol. 54, no. 1,

pp. 391–394, 2008.[35] D. H. Smith, N. Aboluion, R. Montemanni, and S. Perkins, “Linear and nonlinear constructions of DNA codes with Hamming distance d and constant

GC-content,” Discrete Mathematics, vol. 311, no. 13, pp. 1207–1219, 2011.[36] D. Tulpan, D. H. Smith, and R. Montemanni, “Thermodynamic post-processing versus GC-content pre-processing for DNA codes satisfying the

Hamming distance and reverse-complement constraints,” IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), vol. 11,no. 2, pp. 441–452, 2014.

[37] M. A. Bishop, A. G. D’yachkov, A. J. Macula, T. E. Renz, and V. V. Rykov, “Free energy gap and statistical thermodynamic fidelity of DNA codes,”Journal of Computational Biology, vol. 14, no. 8, pp. 1088–1104, 2007.

[38] A. G. D’yachkov, A. J. Macula, W. K. Pogozelski, T. E. Renz, V. V. Rykov, and D. C. Torney, “A weighted insertion-deletion stacked pair thermodynamicmetric for DNA codes,” in DNA Computing. Springer, 2005, pp. 90–103.

[39] Q. Zhang, B. Wang, and X. Wei, “Evaluating the different combinatorial constraints in DNA computing based on minimum free energy,” MATCHCommun. Math. Comput. Chem, vol. 65, pp. 291–308, 2011.

[40] D. Tulpan, M. Andronescu, S. B. Chang, M. R. Shortreed, A. Condon, H. H. Hoos, and L. M. Smith, “Thermodynamically based DNA strand design,”Nucleic acids research, vol. 33, no. 15, pp. 4951–4964, 2005.

[41] Q. Zhang, B. Wang, X. Wei, X. Fang, and C. Zhou, “DNA word set design based on minimum free energy,” NanoBioscience, IEEE Transactions on,vol. 9, no. 4, pp. 273–277, 2010.

[42] S. Yazdi, Y. Yuan, J. Ma, H. Zhao, and O. Milenkovic, “A rewritable, random-access DNA-based storage system,” arXiv preprint arXiv:1505.02199,2015.

[43] D. C. Tulpan, “Effective heuristic methods for DNA strand design,” Ph.D. dissertation, The University Of British Columbia, 2006.[44] U. Feldkamp, H. Rauhe, and W. Banzhaf, “Software tools for DNA sequence design,” Genetic Programming and Evolvable Machines, vol. 4, no. 2,

pp. 153–171, 2003.[45] U. Feldkamp, W. Banzhaf, H. Rauhe et al., “A DNA sequence compiler,” in Poster presented at Sixth International Meeting on DNA Based Computers,

Leiden, 2000.[46] U. Feldkamp, S. Saghafi, W. Banzhaf, and H. Rauhe, “DNAsequencegenerator: A program for the construction of DNA sequences,” in DNA Computing.

Springer, 2001, pp. 23–32.[47] J. M. Gray, T. G. Frutos, A. M. Berman, A. E. Condon, M. G. Lagally, L. M. Smith, and R. M. Corn, “Reducing errors in DNA computing by

appropriate word design,” University of Wisconsin, Department of Chemistry, 1996.[48] P. Hansen, N. Mladenovic, and J. A. M. Perez, “Variable neighbourhood search: methods and applications,” Annals of Operations Research, vol. 175,

no. 1, pp. 367–407, 2010.[49] S. Kawashimo, H. Ono, K. Sadakane, and M. Yamashita, “DNA sequence design by dynamic neighborhood searches,” in DNA Computing. Springer,

2006, pp. 157–171.[50] R. Montemanni and D. H. Smith, “Construction of constant GC-content DNA codes via a variable neighbourhood search algorithm,” Journal of

Mathematical Modelling and Algorithms, vol. 7, no. 3, pp. 311–326, 2008.[51] S. Kawashimo, H. Ono, K. Sadakane, and M. Yamashita, “Dynamic neighborhood searches for thermodynamically designing DNA sequence,” in DNA

Computing. Springer, 2008, pp. 130–139.[52] R. Montemanni, D. Smith, and N. Koul, “Three metaheuristics for the construction of constant GC-content DNA codes,” Lecture Notes in Management

Science, vol. 6, pp. 167–175, 2014.[53] Q. Qiu, D. Burns, Q. Wu, and P. Mukre, “Hybrid architecture for accelerating DNA codeword library searching,” in Computational Intelligence and

Bioinformatics and Computational Biology, 2007. CIBCB’07. IEEE Symposium on. IEEE, 2007, pp. 323–330.

Page 15 of 19

Page 16: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

[54] N. Bennenni, K. Guenda, and A. Gulliver, “Construction of codes for DNA computing by the greedy algorithm,” ACM Commun. Comput. Algebra,vol. 49, no. 1, pp. 14–14, Jun. 2015. [Online]. Available: http://doi.acm.org/10.1145/2768577.2768583

[55] O. D. King, “Bounds for DNA codes with constant GC-content,” Electron. J. Combin, vol. 10, no. 1, p. 33, 2003.[56] D. C. Tulpan, H. H. Hoos, and A. E. Condon, “Stochastic local search algorithms for DNA word design,” in DNA Computing. Springer, 2003, pp.

229–241.[57] R. Deaton, M. Garzon, R. Murphy, and J. R. Koza, “Genetic search of reliable encodings for DNA-based computation,” Late Breaking Papers at the

Genetic Programming 1996, pp. 9–15, 1996.[58] M. Arita and S. Kobayashi, “DNA sequence design using templates,” New Generation Computing, vol. 20, no. 3, pp. 263–277, 2002.[59] W. Liu, S. Wang, L. Gao, F. Zhang, and J. Xu, “DNA sequence design based on template strategy,” Journal of chemical information and computer

sciences, vol. 43, no. 6, pp. 2014–2018, 2003.[60] S. Kobayashi, T. Kondo, and M. Arita, “On template method for DNA sequence design,” in DNA Computing. Springer, 2003, pp. 205–214.[61] R. Selvakumar, “Unconventional construction of DNA codes: group homomorphism,” Journal of Discrete Mathematical Sciences and Cryptography,

vol. 17, no. 3, pp. 227–237, 2014.[62] P. Gaborit and O. D. King, “Linear constructions for DNA codes,” Theoretical Computer Science, vol. 334, no. 1, pp. 99–113, 2005.[63] N. Aboluion, D. H. Smith, and S. Perkins, “Linear and nonlinear constructions of DNA codes with Hamming distance d, constant GC-content and a

reverse-complement constraint,” Discrete Mathematics, vol. 312, no. 5, pp. 1062–1075, 2012.[64] L. Faria, A. Rocha, J. Kleinschmidt, R. Palazzo, and M. Silva-Filho, “DNA sequences generated by BCH codes over GF (4),” Electronics letters,

vol. 46, no. 3, pp. 202–203, 2010.[65] T. Abualrub, A. Ghrayeb, and X. N. Zeng, “Construction of cyclic codes over GF(4) for DNA computing,” Journal of the Franklin Institute, vol. 343,

no. 4, pp. 448–457, 2006.[66] J. Cannon, A. Steel et al., “The MAGMA computational algebra system,” Software available online (magma. maths. usyd. edu. au), 2005.[67] A. Heck and W. Koepf, Introduction to MAPLE. Springer-Verlag New York, 1993, vol. 1993.[68] A. Niema, “The construction of DNA codes using a computer algebra system,” Ph.D. dissertation, PhD Thesis, University of Glamorgan, 2011.[69] A. S. L. Rocha, L. C. B. Faria, J. H. Kleinschmidt, R. Palazzo Jr, and M. C. Silva-Filho, “DNA sequences generated by Z4 linear codes,” in Information

Theory Proceedings (ISIT), 2010 IEEE International Symposium on. IEEE, 2010, pp. 1320–1324.[70] B. Feng, S. Bai, B. Chen, and X. Zhou, “The constructions of DNA codes from linear self-dual codes over Z4,” International Conference on Computer

Information Systems and Industrial Applications, 2015.[71] Z. Varbanov, T. Todorov, and M. Hristova, “A method for constructing DNA codes from additive self-dual codes over GF (4).” in Proc. CAIM conference,

Romania, vol. 40, 2014.[72] S. E. Oztas and I. Siap, “Lifted polynomials over F16 and their applications to DNA codes,” Filomat, vol. 27, no. 3, pp. 459–466, 2013.[73] B. Yildiz and I. Siap, “Cyclic codes over F2[u]/(u4 − 1) and applications to DNA codes,” Computers & Mathematics with Applications, vol. 63,

no. 7, pp. 1169–1176, 2012.[74] K. Guenda, T. A. Gulliver, and P. Sole, “On cyclic DNA codes,” in Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on.

IEEE, 2013, pp. 121–125.[75] J. Liang and L. Wang, “On cyclic DNA codes over F2 + uF2,” Journal of Applied Mathematics and Computing, pp. 1–11, 2015.[76] S. Pattanayak and A. K. Singh, “On cyclic DNA codes over the ring Z4 + uZ4,” arXiv preprint arXiv:1508.02015, 2015.[77] F. Ma, Y. Cao, and J. Gao, “On cyclic DNA codes over F4[u]/(u2 +1),” International Journal of Research and Reviews in Applied Sciences, vol. 24,

no. 3, p. 101, 2015.[78] S. Zhu and X. Chen, “Cyclic DNA codes over F2 + uF2 + vF2 + uvF2,” arXiv preprint arXiv:1508.07113, 2015.[79] A. Bayram, E. S. Oztas, and I. Siap, “Codes over F4 + vF4 and some DNA applications,” Designs, Codes and Cryptography, pp. 1–15, 2015.[80] B. Srinivasulu and M. Bhaintwal, “Reversible cyclic codes over F4 +uF4 and their applications to DNA codes,” in 2015 7th International Conference

on Information Technology and Electrical Engineering (ICITEE). IEEE, 2015, pp. 101–105.[81] N. Bennenni, K. Guenda, and S. Mesnager, “New DNA cyclic codes over rings,” arXiv preprint arXiv:1505.06263, 2015.[82] I. Siap, T. Abualrub, and A. Ghrayeb, “Cyclic DNA codes over the ring F2[u]/(u2 − 1) based on the deletion distance,” Journal of the Franklin

Institute, vol. 346, no. 8, pp. 731–740, 2009.[83] H. Mostafanasab and A. Y. Darani, “On cyclic DNA codes over F2 + uF2 + u2F2,” arXiv preprint arXiv:1603.05894, 2016.[84] K. Guenda and T. A. Gulliver, “Construction of cyclic codes over F2 + uF2 for DNA computing,” arXiv preprint arXiv:1207.3385, 2012.[85] A. Dertli and Y. Cengellenmis, “On cyclic DNA codes over the rings z {4}+ wz {4} and z {4}+ wz {4}+ vz {4}+ wvz {4},” arXiv preprint

arXiv:1605.02968, 2016.[86] H. Hong, L. Wang, H. Ahmad, J. Li, Y. Yang, and C. Wu, “Construction of DNA codes by using algebraic number theory,” Finite Fields and Their

Applications, vol. 37, pp. 328–343, 2016.[87] A. J. Ruben, S. J. Freeland, and L. F. Landweber, “Punch: An evolutionary algorithm for optimizing bit set selection,” in DNA Computing. Springer,

2001, pp. 150–160.[88] Z. Ignatova, I. Martinez-Perez, and K.-H. Zimmermann, DNA computing models. Springer Science & Business Media, 2008.[89] Q. Zhang and B. Wang, “On the bounds of DNA coding with h-distance,” MATCH Commun. Math. Comput. Chem, vol. 66, pp. 371–380, 2011.[90] L. M. Smith, R. M. Corn, A. E. Condon, M. G. Lagally, A. G. Frutos, Q. Liu, and A. J. Thiel, “A surface-based approach to DNA computation,”

Journal of computational biology, vol. 5, no. 2, pp. 255–267, 1998.[91] A. G. Frutos, Q. Liu, A. J. Thiel, A. M. W. Sanner, A. E. Condon, L. M. Smith, and R. M. Corn, “Demonstration of a word design strategy for DNA

computing on surfaces,” Nucleic Acids Research, vol. 25, no. 23, pp. 4748–4757, 1997.[92] Q. Liu, L. Wang, A. G. Frutos, A. E. Condon, R. M. Corn, and L. M. Smith, “DNA computing on surfaces,” Nature, vol. 403, no. 6766, pp. 175–179,

2000.[93] H. Wu, “An improved surface-based method for DNA computation,” Biosystems, vol. 59, no. 1, pp. 1–5, 2001.[94] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, “Quantitative monitoring of gene expression patterns with a complementary DNA microarray,”

Science, vol. 270, no. 5235, pp. 467–470, 1995.[95] S. Brenner and R. A. Lerner, “Encoded combinatorial chemistry.” Proceedings of the National Academy of Sciences, vol. 89, no. 12, pp. 5381–5383,

1992.[96] J. Reif, H. Chandran, N. Gopalkrishnan, and T. LaBean, “Self-assembled DNA nanostructures and DNA devices,” Nanofabrication Handbook, pp.

299–328, 2012.[97] Y. Yang, Y. Liu, and H. Yan, “DNA nanostructures as programmable biomolecular scaffolds,” Bioconjugate chemistry, 2015.[98] G. Jacob and A. Murugan, “DNA based cryptography: An overview and analysis,” Int J Emerg Sci, vol. 3, no. 1, pp. 27–36, 2013.[99] D. Tulpan, C. Regoui, G. Durand, L. Belliveau, and S. Leger, “Hyden: a hybrid steganocryptographic approach for data encryption using randomized

error-correcting DNA codes,” BioMed research international, vol. 2013, 2013.[100] A. Aich, A. Sen, S. R. Dash, and S. Dehuri, “A symmetric key cryptosystem using DNA sequence with OTP key,” in Information Systems Design and

Intelligent Applications. Springer, 2015, pp. 207–215.[101] X. Wei, L. Guo, Q. Zhang, J. Zhang, and S. Lian, “A novel color image encryption algorithm based on DNA sequence operation and hyper-chaotic

system,” Journal of Systems and Software, vol. 85, no. 2, pp. 290–299, 2012.

Page 16 of 19

Page 17: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

[102] R. Enayatifar, A. H. Abdullah, and I. F. Isnin, “Chaos-based image encryption using a hybrid genetic algorithm and a DNA sequence,” Optics andLasers in Engineering, vol. 56, pp. 83–93, 2014.

[103] M. Arita, “Method for designing DNA codes used as information carrier,” May 27 2004, uS Patent App. 10/558,502.[104] G. M. Church, Y. Gao, and S. Kosuri, “Next-generation digital information storage in DNA,” Science, vol. 337, no. 6102, pp. 1628–1628, 2012.[105] N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust, B. Sipos, and E. Birney, “Towards practical, high-capacity, low-maintenance information

storage in synthesized DNA,” Nature, vol. 494, no. 7435, pp. 77–80, 2013.[106] S. Yazdi, H. M. Kiah, E. R. Garcia, J. Ma, H. Zhao, and O. Milenkovic, “DNA-based storage: Trends and methods,” arXiv preprint arXiv:1507.01611,

2015.[107] S. Tsaftaris, A. K. Katsaggelos, T. N. Pappas, E. T. Papoutsakis et al., “DNA computing from a signal processing viewpoint,” Signal Processing

Magazine, IEEE, vol. 21, no. 5, pp. 100–106, 2004.[108] L. Qian and E. Winfree, “Scaling up digital circuit computation with DNA strand displacement cascades,” Science, vol. 332, no. 6034, pp. 1196–1201,

2011.[109] M. H. Garzon and T.-Y. Wong, “DNA chips for species identification and biological phylogenies,” Natural Computing, vol. 10, no. 1, pp. 375–389,

2011.[110] J. Dingel and O. Milenkovic, “Coding-theoretic methods for reverse engineering of gene regulatory networks,” in Information Theory Workshop, 2008.

ITW’08. IEEE. IEEE, 2008, pp. 114–118.[111] D. G. Arques and C. J. Michel, “A complementary circular code in the protein coding genes,” Journal of theoretical biology, vol. 182, no. 1, pp. 45–58,

1996.[112] D. G. Arques and C. J. Michel, “A code in the protein coding genes,” BioSystems, vol. 44, no. 2, pp. 107–134, 1997.[113] C. J. Michel, “A 2006 review of circular codes in genes,” Computers & Mathematics with Applications, vol. 55, no. 5, pp. 984–988, 2008.[114] M. H. Garzon, “Theory and applications of DNA codeword design,” in Theory and Practice of Natural Computing. Springer, 2012, pp. 11–26.[115] L. V. Bystrykh, “Generalized DNA barcode design based on Hamming codes,” PloS one, vol. 7, no. 5, p. e36852, 2012.

APPENDIX ABOUNDS ON DNA CODES

1) Johnson type Bound- This bound is derived by shortening the code to length n− 1. The code is modified by choosingthe codewords with a fix character b ∈ Zq at the ith position where i ∈ {1 . . . n} and deleting ith position from thecodewords.For 0 ≤ d ≤ n and 0 < w < n,

a) AGC4 (n, d, w) ≤ b 2nw A

GC4 (n− 1, d, w − 1)c [55].

b) AGC4 (n, d, w) ≤ b 2n

n−wAGC4 (n− 1, d, w)c [55].

c) AR4 (n, d) ≤ b 14A

R4 (n− 1, d)c [88].

2) Halving Bounds-a) This is motivated by the fact DNA code CDNA and reverse DNA code C R

DNA are disjoint. Let n ≥ 1 be an integer.

For each integer d with 0 < d ≤ n, AR4 (n, d) ≤ 1

2A4(n, d) [88].

b) For 0 < d ≤ n and 0 ≤ w ≤ n, AGC,RC4 (n, d, w) ≤ 1

2AGC4 (n, d, w) [88].

c) For 0 < d ≤ n and 0 ≤ w ≤ n,AGC,R4 (n, d, w) ≤ 1

2AGC4 (n, d, w) [88].

3) Gilbert-Type Bounds-AGC4 (n, d, w) is derived by dividing the total number of DNA strings of length n with GC-content

w by the number of DNA string with distance d− 1 from the fixed codeword.a) For 0 ≤ d ≤ n and 0 ≤ w ≤ n,

AGC4 (n, d, w) ≥

(nw

)2w2n−w∑d−1

r=0

∑minbr/2c,w,n−wi=0

(wi

)(n−w

i

)(n−2ir−2i

)22i

[88].

b) For 0 ≤ d ≤ n and 0 ≤ w ≤ n,

AGC4 (n, d, w) ≥

(nw

)2n∑d−1

r=0

∑minbr/2c,w,n−wi=0

(wi

)(n−w

i

)(n−2ir−2i

)22i

[55].

4) This bound is based on the aspect of GC-content of given DNA codeword is equal to reverse of DNA codeword [55].For 0 ≤ d ≤ n and 0 ≤ w ≤ n,

a) AGC,RC4 (n, d, w) = AGC,R

4 (n, d, w) if n is even.

b) AGC,RC4 (n, d, w) ≤ AGC,R

4 (n+ 1, d+ 1, w) if n is odd.

c) AGC,R4 (n, d+ 1, w) ≤ AGC,R

4 (n, d, w) ≤ AGC,R4 (n, d− 1, w) if n is odd.

5) By using AGC4 (n, d, w) ≥ A2(n, d, w) ·A2(n, d) inequality, following bound is derived [55].

For 0 ≤ w ≤ n, AGC4 (n, 2, w) =

(n

w

)2n−1.

Page 17 of 19

Page 18: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

6) By using Halving bound AGC,RC4 (n, d, w) ≤ 1

2AGC4 (n, d, w) for d = 2 one can calculate the bound [55].

For 0 ≤ w ≤ n and n is even,

AGC,RC4 (n, 2, w) =

(n

w

)2n−2.

7) Bound AGC4 (n, d, w) is computed by dividing the total number of words with GC-content w that are at distance at least

d from their reverse-complements by the number of these codewords that are at distance at most d− 1 from any fixedcodeword [55].

For 0 ≤ d ≤ n and 0 ≤ w ≤ n,

AGC4 (n, d, w) ≥

∑nr=d V (n, d, r)

2∑d−1

r=0

∑minbr/2c,w,n−wi=0

(wi

)(n−w

i

)(n−2ir−2i

)22i

8) Product Bounds - This is based on the construction of the DNA code AGC4 (n, d, w) with length n, minimum Ham-

ming distance d and GC-content w from binary constant-weight codes A2(n, d, w) and ternary constant-weight codesA3(n, d, w) with length n, Hamming weight w and minimum Hamming distance d [55] [88].For 0 ≤ d ≤ n and 0 ≤ w ≤ n,

a) AGC4 (n, d, w) ≥ A2(n, d, w) ·A2(n, d)

b) AGC,R4 (n, d, w) ≥ AR

2 (n, d, w) ·A2(n, d)

c) AGC,R4 (n, d, w) ≥ A2(n, d, w) ·AR

2 (n, d)

d) AGC4 (n, d, w) ≥ A3(n, d, w) ·A2(n− w, d)

e) AGC,R4 (n, d, w) ≥ AR

3 (n, d, w) ·A2(n− w, d)

f) AGC,R4 (n, d, w) ≥ A3(n, d, w) ·AR

2 (n− w, d)

g) AR4 (n, d) ≥ AR

2 (n, d) ·A2(n, d).

9) Reverse and Reverse complement codes bounds - This can be simple observed by the reverse and reverse complementproperty of the DNA codeword [88] [14].Let n ≥ 1 be an integer,

a) If n is even, then ARC4 (n, d) = AR

4 (n, d) and

b) If n is odd, then ARC4 (n, d) ≤ AR

4 (n+ 1, d+ 1)

10) Special cases - Bounds are observed by considering different combination of length n and GC-content w [14].For n > 0, with 0 ≤ d ≤ n and 0 ≤ w ≤ n,

a) AGC4 (n, d, 0) = A2(n, d)

b) AGC4 (n, d, w) = AGC

4 (n, d, n− w)

c) AGC4 (n, n,w) = 4 if w = n/2

3 if n ≤ w < n/2 or n/2 < w ≤ 2n/32 if w < n/3 or w > 2n/3

d) AGC,RC

4 (n, n,w) = {2 if w = n/21 if w 6= n/2

}e) AGC

4 (n, 1, w) =

(n

w

)2n

Page 18 of 19

Page 19: 1 The Art of DNA Strings: Sixteen Years of DNA Coding TheoryThe Art of DNA Strings: Sixteen Years of DNA Coding Theory Dixita Limbachiya, Bansari Rao and Manish K. Gupta Dhirubhai

11) Bounds on reverse code for d = 3 - This bound is computed from the concept of sphere-packing bound for codes [14].

ARq (n, 3) ≤

qdn/2e∑i = 2bn/2c

(bn/2ci

)(q − 1)i

2(1 + 4(q − 2) + (n− 4)(q − 1)).

12) Bounds on reverse code of size S - By using Greedy approach to calculate size of the code, bound is derived [14].

Let V (s, d) be the number of words of S such that they have distance d from s where sεS.

a) ARq (n, d) ≤ |S|

2V −(b(d− 1)/2c)where V −(d) = min{V (s, d)|sεS}.

b) ARq (n, d) ≥ |S|

2V +d− 1where, V +(d) = max{V (s, d)|sεS}.

13) Bounds on reverse code for d = 2 - Bounds on reverse code for d = 2 is obtained by claims• Any two words from the same subset differ in at least two positions ie. dH(xDNA, yDNA) ≥ 2 ∀ xDNA, yDNA ∈ CDNA.• If a word belongs to a subset, its reversal is also in the same subset ie. if xDNA ∈ CDNA then xR

DNA ∈ CDNA

• All the qn/2 palindromes are in the same subset.

a) ARq (n, 2) =

qn−1

2, for even n and qε{2, 4}, and

b) ARq (n, 2) =

qn−1 − qbn/2c

2, for odd n and qε{2, 4}

14) Doubling Construction - This is motivated by the minimum Hamming distance between the DNA codeword, itsreverse codeword and revere complement. It is observed from the property of DNA code with dH(xDNA, yDNA) ≥ d,HDNA(xR

DNA, yDNA) ≥ d, HDNA(xDNA, yCDNA) ≥ d and HDNA(xDNA, yRC

DNA) ≥ d

For n ≥ 2, AR2 (2n, 2n−1) = 2n [14].

15) a) Bounds on even and odd length n reverse code - In this bounds, maximum size of reverse code of even and oddlength n and relationship between reverse code of even and odd length n− 1 is demonstrated [14].

ARq (n− 1, d) ≤ AR

q (n, d) ≤ ARq (n, d− 1) and

b) ARq (n− 1, d) ≥ AR

q (n, d)/q, for odd n.16) Bounds on Hamming distance constraint- V.Phan provided lower and upper bounds of DNA codeword sets which satisfy

the h-distance constraint [89].4n−d+1

d(

nd−1) ≤ |S| ≤ 4n

Page 19 of 19