dna discontinuity analysis: an algorithmic system

DNA Discontinuity Analysis: An Algorithmic System

Md. Sarwar Kamal1*,Sonia Farhana Nimmy2, Mohammad Ibrahim Khan1, Mohammad Shibli Kaysar1, Shuxiang Xu3

Affiliations: 1Department of Computer Science & Engineering, Chittagong University of Engineering and Technology (CUET), Chittagong, 4349,Bangladesh. 2BGC Trust University Bangladesh. 3 University of Tasmania, Australia

* Corresponding Author: Md. Sarwar Kamal

[email protected]

Department of Computer Science & Engineering, Chittagong University of Engineering and

Technology (CUET), Chittagong, 4349,Bangladesh

Cell:8801553315278

Keyword: DNA damage, STACUMSUM, Maximum Likelihood, Geometric Distance.

Running Title: Automated DNA-break detection

Abstract

Damages or breaks in DNA may change the characteristics of genomes and causes various

diseases. In this work we construct a system that incorporates the Maximum Likelihood (ML)

based probabilistic formula to assess the number of damages occurred in any DNA sequences.

This approach has been progressively benchmarked by implementing simulated data set so that

the outcomes can be compared with ground truth or reference value. At first the sequence data

set order is checked through Statistical Cumulative Sum (STACUMSUM). The verified

sequences are then estimated by prior and posterior probability to count the percentages of

breaks and mutations. Maximum Likelihood Estimation (MLE) then finds out the exact numbers

and positions of breaks and detections. In data base manipulation, one factor that decides the

orientation and order of the sequence is geometric distance between consecutive sequences. The

geometric distance is measured for smooth representation of the genome or DNA sequences.

Finally we compared the performance of our system with DAMBE5: (A Comprehensive

Software Package for Data Analysis in Molecular Biology and Evaluation), and in response to

time and space complexity, our Automated System for Structural Break Detection (ASSBD) is

much faster and consumes much less space due to our algorithmic approaches.

INTRODUCTION

Recent development in experimental biotechnology and the advent of next generation

sequencing (NGS) technology have produced large volume of sequence data. The genome

sequences assure the regular features and normal condition of plants, animals or any other

organisms. However, due to external and/or internal factors DNA sequences breaks or damage

may occur. It is a crucial to determine the break points from the long sequences. Generally, a

hypothesis test is used to phrase the structural breaks detections. For example, several recent

works [1-3] have attempted to address the structural break detections. There are various reasons

that may cause structural breaks. One of the pivotal reasons is cancer therapy. Although

chemotherapy is a popular means to prevent the cancer cells among existing treatments, it might

damages the structures of regular DNA sequences [4-6].

Genome integrity is deliberately confronted by DNA lacerations where millions of which

occurs in each human cell within in a year [7]. The maximum frequencies of these lacerations

crop up due to cell metabolism, DNA clone, radiation, excessive use of poisonous environmental

chemicals and programmed DNA laceration in lymphocytes and germ cells [8-12]. The changes

in DNA due to damages can have destructive results and it can finally result in mutations and

chromosomal aberrations. The wrong signal in response of breaks or damages pony up to

crumbling and creates various irregularities such as developmental flaw, neurodegenerative

epidemic, and cancer [9] that focuses the severe demands of proper DNA Damage Response

(DDR) for all living cell and organism growth. DNA damage can occur in both single and in

double strand. Here we mainly designed simulative data-set with DNA single strand breaks

(SSBs) and double-strand breaks (DSBs) to study structural break.

Double strand breaks are the foremost important profile of DNA damage because these

occupy full length of the DNA sequences. DSBs are formulated under the ionizing radiation (IR)

or radiomimetic drugs and also appear in cells treated with topoisomerase II inhibitors.

Sometimes DNA double strand breaks frequencies increases with the changes in singe strands

breaks [8, 9].

To address the DNA damages the researches in dry and wet labs, various tools have been

designed [13-33]. One important factor that provides efficient computing environment is proper

memory size and utilization of memory space. In this regards we have estimated the geometric

distance between two consecutive nucleotide bases. One standard way to reduce the complexity

of bonding nucleotide bases in the memory is to use shared modules [34]. Processor of the

system finds the structural break in round robin fashion [35]. In the age of information super

highway, high quality of DNA sequencing is the critical issues not only in Computer Science but

also for biological and medical problems. Current development of DNA sequencing technology

[36] such as Illumina [http: //www.illumina.com.] and SoLiD [http:

//www.appliedbiosystems.com.] etc has reduced the cost significantly and hence sequence are

data generated in large quantities in every day.

The ultimate target for each and every living organ is to bear its genetic concrete,

flawless, unbroken and unscathed, to the back-to-back propagation. This should be accomplished

in spite of continual aggression by internal and environmental promoter on the DNA. To reverse

this danger, it is very essential to detect DNA damage and structural breaks among sequences,

signal its habitation and conciliate breaks and damage repair. The reactions that impact a large

numbers of organic circumstances are anatomically momentous because they prohibit divergent

human contamination. Besides, it has been found that approximately each of the 1013 cells in the

human body seize tens of thousands of DNA injured per day [37]. These abrasions can stop

genome reproduction and clone, that results mutations in sequences. It will be very efficient and

significant if these damages, breaks and any kind of alternation could be identified under

automated tools or environment. But there are only way to detect the breaks in laboratory which

are always expensive and time consuming. To mitigate the time and cost towards the detections

of breaks, damages or any changes among sequences, we have designed this automated tool.

In this study we have taken in silico approach to design an automated tool to detect any

such DNA damage or breaks. At first we have collected the random DNA sequences in FASTA

format. We demonstrate the Statistical Cumulative Sum (STACUMSUM) to investigate the

order of the collected data set i.e. to check the proper structure and format of the collected data-

set. If any data set loose the order, our system correct it and then compute the damages or breaks

percentages using the prior and posterior probabilities. The final analysis is accomplished by

applying Maximum Likelihood Estimation among all estimated data set to find out the exact

breaks and damages. To ensure the proper memory space geometric distances two consecutive

base pair is measured to design proper memory space in both data base as well as hardware. We

have also measured the performance of this system and DAMBE5 in the view of time and space

complexity. Due to the memory reduction facility our system outperforms DAMBE5 in these

parameters.

METHODS

Statistical Cumulative Sum: Here we have imposed the STACUMSUM, which is an integer representation of complete DNA

sequences. It is a probabilistic process which measures complete sequence structure in a

numerical format. In our previous work [38, 39] we have established the DNA damage

identification using ontological analysis. After detecting the damage of the sequence it is very

efficient to identify the DNA sequence with structural breaks. It is possible to show the damages

of the sequences using the equation of signal to noise model in signal processing. If Z is set of

integer than we define the damage making correlate with Signal to Noise Ratio (SNR) as:

Ztttt qP , ……………………………………(1)

where, q =Signal or data set, ɛt=Damage or Noise of the sequences.

It is analogous to define the complete sequence using the concept of equation 1 to measure

breaks in the sequences. Besides, the structure may be more complicated if successive

nucleotides are not chosen randomly but the probabilities of the nucleotides depend on preceding

data set. In general case of this type selection depends only on the preceding nucleotides. To

make a co-relation between various steps statistical structure can be used by a set of transition

random probabilities. In this case we can say the probability is PrI(J). I and J is the range for all

probable nucleotide set. Diagram Probabilities Pr(I,J) is also a very effective way to narrates the

relationship between Pr(I)=The transition random probabilities, and PrI(J), which is similar with

diagram (Figure 1) of probabilities Pr(I,J) (equation 2):

Pr(I)= JJ

IJIJJI ))(Pr(),Pr(),Pr(

)(Pr)Pr(),Pr( JIJI I

)2(..............................1),Pr()Pr()(Pr,

JIIJ

I JIIj

The full probabilities set (Table 1 and 2) shows the probable formation of the data set.

A sequence is built according to the probabilistic values form the Table 1 and 2:

AAA TCC TCG TTA TTT TTG TAG TAC TCG GCT GGG GAC GGA AGC AGT TGG AGC

AGT TCC CCT CGC CTA TAC TTG.

To select random nucleotides base pair, this table helps to automatically generate the sequences

for our tool. Table 1 is based on the iteration of only the J values i.e. the I values iteration will be

remain constant. In the sense of DNA double strand the second strand nucleotides will be remain

unchanged for complete genome. Table 2 reflects both (I and J) values will be change for double

stand data set. This consideration is powerful to assess the data set more precisely and in faster

fashion. If random probabilistic value on A,G,T,C are 0.1,0.1,0.1,0.3,0.4,0.6 then we will get the

above sequence. The data set In case of Structural break the above sequence will be interrupted

and different new signal will be generate. Later we have checked the structural break for Prior

and Posterior probabilistic system.

Structural Break Prior Mean: For countably infinite nucleotides in DNA sequences, let consider n observations P1, P2

……………..Pn for some real valued stochastic process in equation 1 with null hypothesis of

fixed prior means H0:µ1=……………………µn. Sometimes the hypotheses are not always true.

So the discrete Statistical Cumulative Sum for distinct values [0,1] as

)3.(].........1,0[1)(1 1

X

nx

t

nx

tttn P

nnxP

nXZ

According to the Functional Central Limit Theorem (FCLT) the equation 2 cannot be imposed

directly. In that case standardized partial sum process is essentially applied:

)4..(....................].........1,0[1)(1

Xn

XSnnx

tt

The evaluation of hypothesis H0 under the test argument X=k/n. Consequently the

STACUMSUM process Zn(X) compare the sample mean and global mean for all observations.

Since the structural break of DNA sequence timing is unknown, we have checked all possible

choices K €{1, 2, ………………n}. This results maximum outcomes of the Structural breaks.

)5..(....................).........(max1nkZM n

nn

Where, ω> 0, is scaling parameter of Nucleotides sequences in any DNA segments.

Structural Break Posterior Mean:

The structural break in Unconditional mean narrated at prior mean can be measured in

Conditional environment so that liner regression model will imposed on the Nucleotide data set.

To do that, the break can be measureable under multidimensional covariate data set as DNA

segments. If A,G,C,T,A,G,C,T,…………….is countable infinite sequence of some mutually

exclusive Nucleotide data set then we can say that

.)()|Pr()Pr()Pr(11

i

iii

i GPGAGAA

If A, C and T are nucleotides, where A and C are conditionally independent based on G.

Therefore, the conditional independence for the structural break will be

).|Pr()|Pr()|Pr( GCGAGCA Again T be a random Nucleotide and Q be an DNA data

segment such as Pr(Q)> 0. So the posterior distribution of T given Q will be

.)Pr(

)}Pr({)|Pr()|(| QQtTQtTQTF QT

When T is discrete the posterior probability mass function of T given the DNA segment Q and

the conditional expectation of T given A are

)()}Pr({}|Pr{)|(| QP

QtTQtTQTp QrT

.)|(]|[ |x

QrT QTptQTE Consequently if T is continues then the expectation will be the

different as previous. .)|(]|[ |

t

AT QTdFtQTE For double Nucleotides T and C under discrete

structural break, then the conditional probability of T given C=c and conditional expectation T of

given C=c are

)(),(

}Pr{},Pr{)|( ,

| cpctp

cCcCtTctp

r

CrTCrT

.)|(]|[ |x

CrT ctptcCTE During the continuity for same data set we have the following

impacts as )(

),()|( ,

| cfctf

ctfC

CTCT

.)|(]|[ |

t

CT dcctftcCTE To find out the expectations for all data set of DNA sequences it is

very important to have iterative Expectations. The Iterated expectations of these two Nucleotides

set we can find the followings:

][

)(

),(

)()|(

)(]|[]]|[[

,

|

TE

dttft

dtdcctft

dccfdtctft

dccftCTECTEE

tT

t yCT

cYC

tCT

cC

For all random Nucleotides the overall expectation can be measured by summing the total data

set. Suppose N is a positive number for all DNA segments, then the total expectation is

].[

][

][

]|[][

1

1

11

TEn

TE

TE

nNTEnNTE

n

ii

n

ii

n

ii

N

ii

Instead of Thiamine (T) the posterior probability will be same for remaining Adenine (A),

Guanine (G) and Cytosine(C).

Estimation of Maximum Likelihood:

To select a model for handing unknown nucleotide set or parameters, we define the probability

of set observations for any DNA segments under certain conditions. Set of resultant outcomes

have been measured in real world problems. The measured observations help to choose a set of

parameters in the experiments which are most likely to generate the observed results. The

observation of the outcomes based on the pivotal parameters, called the Maximum Likely Hood

(MLE) estimation, has provided consistent and efficient results in structural break finding [40,

41]. Nucleotide data set with a few structural breaks, as for example, in 10 DNA segments with 5

defective for structural break easily estimates that 50% is defective, however, for large and

uncertain data set it is necessary to have an established formula. For a nucleotide data set of n

length and the probability of structural break if M, then

)-(1M)!-(nM!

n!=P M-nMr

is the ratio of structural breaks in the total data set. In a word we can define the likelihood as

LH(parameter | data) = P(data | parameter). According to this equation we measure the values

using log likelihood as follows:

)5........()1(log)(log(

mnx

mn

LH

For lots of parameters we have imposed Taylor Series expansion for Maximum Likelihood

estimation as: :

O(1)+)-())(LH()-0.5(+

))(LH()-(+))(LH(=))(LH(

2

ˆlogˆ

logˆˆloglog 0=)SH(=))(LH(

log

)IH(= ))(LH(2

log

So,,

)-)(IH()-0.5(+))(LH(=))(LH( ˆˆˆloglog )-)(IH()-0.5.(=))(LH(-))(LH( rrr ˆˆˆlogˆlog from the above illustrations equation (5) helps to calculate the maximum value for any DNA

sequence data set or Genome segments. The later equations are the total process of Maximum

Likelihood Estimation.

Estimation of Minimum Geometric Distance: We know that the human genome has 3.1billiion base pairs. To maintain hardware flexibility and

memory efficiency it is very important to know the minimum distance for any two consecutive

base pair or structural breaks. For Hardware support, we have to determine the proper

geometrical architecture for consecutive DNA base pair. In the molecular data set of DNA base

as ATGC, some non-overlapping (data set without breaks) data set must be used. Each base has

to move a half part of the bond in a double bond DNA sequences. Besides, all base pair should

be synchronized so that comparisons for breaks start at the same time. Suppose, a shared

memory synchronous a combined machine in which a set of g processor can fetches to a set of h

memory modules in parallel. It is obvious that all the data set of DNA sequences are connected

to the memory modules through a switching networking system to maintain smooth memory

access and retrieval. In a word, the access procedures are called Distributed Memory Machine

(DMM). We have illustrated the geometrical orientation of Nucleotide bases for parallel and

regular memory (Figure 2). Suppose the threshold range between two consecutive nucleotide =

Sth; the angular distance of the memory location = π; the radial distance of the memory modules=

Dr; for a circle we know that circumference C=2πr; the angel between arch length and radius =θ;

For radian angle (Figure 3) we know that θ=C/r; the circumference of the memory module for

remaining part of the angle (π-θ) will be

thSC )(2

and

th

Drd

SHypotenousBaseCos

2

)3

,2

()2

(1

st

Drd

SCos

.

If,

3600

Now the arch length for this angel will be

3

SthSC th

So for the minimum distance for two consecutive Nucleotides will be as follows

3

tanthS

CArclength

nceCircumfereceMinimumDis

th

thth

SS

SthS )(6

3

)(2

Implementation: We have implemented and experimented under the environments of Java with Integrated

Development Environment (IDE) Netbeans. The object oriented implementation helped us to

perform the nucleotides (A, C, T, and G) as a distinct object. In our previous work [38, 39] we

have improved the performance of [42] and observed that our RSAM algorithm is significantly

better in the terms of speed, complexity, space, sensitivity, accuracy and risk. Here, we have

integrated the methods under the format of System. The system as well as the analysis makes it

different than any other systems. We have designed it as an automated system (Figure 4).

RESULTS AND DISCUSSION

The outcomes from this system are noted as how much time the systems consuming for various

length of data set starting from 160MB and ends with 1650MB. Due to the algorithmic design

and analysis, this system takes less time for processing nucleotide base pairs (Figure 5). From

existing tool, we imposed same data set on DAMBE5: A Comprehensive Software Package for

Data Analysis in Molecular Biology and Evaluation [43]. The results from this system (Figure 6)

depicts that in all respects of data set it takes more times. The comparison between these two

systems (Figure 7) clearly noticed the efficiency between these two tools.

We have compared our system with the DAMBE5 [43]. DAMBE5 although useful for

sequence retrieval, motif characterization, codon adaptation index, molecular phylogetics

sequencing, etc however there are no algorithmic comparisons of time and space complexity. In

our system we have addressed time and space complexity most. To handle large and big data set

the it is imperative that a system perform faster as well as take less space. For same data set of

DNA nucleotides base our system can perform significantly better than that of DAMBE5 due to

less formatting and structures (Figure 6, 7). Although, our system supports less formatting and

structures, it occupies much less space and takes much less time. In current era, less time

consumption is very essential for various systems and environments. Besides, only algorithmic

development can improve the performances of the computing devices.

DNA Sequence Breaks Comparisons: We have developed the system based on the Maximum Likelihood, Posterior Mean and Prior

Mean. This system used primary and secondary data set of DNA segments collected from NCBI

data base (http://www.ncbi.nlm.nih.gov). Based on the data set the system works very effectively.

On the other hand DAMBE takes the input as Vernna RNA Secondary Structure library [44] to

determine the secondary structure of RNA and compare their minimum folding energy.

Both the cases, DAMBE requires certain formats of data set. According to the

algorithmic foundation our automated system works on platform independent structures, lengths

and formats. The core features of independency come from the Java Development Toolkit (JDK)

environment. The system we have designed here is based on Object Oriented Programming

(OOP) comparing with fixed formatted languages. The result of the proposed system (Figure 5)

depicts the outcomes of our algorithmic process. Consequently the results of DAMBE5 (Figure 6)

shows that it takes more time than that of our system for the same data set.

DAMBE5 supports multiple formats of the DNA Segments structures [45]. For the same

data set of DNA segments and lengths, DAMBE5 generate different results compared to our

system (Figure 5 and 6). We have checked the reasons behind the over timing on this system and

found that the compiler takes much time to compile the structures of the DNA segments.

The main differences between our system and DAMBE5 (Figure 7) are in compilation

time, size of the generated objects, dynamic memory usage during compilation and template

instantiation time.

Our proposed system takes less compilation time. This matrix is determined based on link

time. The link time affect the productivity of the system. DAMBE5 uses the shell timing which a

distinction between user and system time but the difference is not a meaningful to the users. For

Java, the compilation does not support the incremental compilation below the granularity of a

whole module. Our system only takes the CPU intensive time and that why it is better than that

of other tools.

Due to the excessive data set on DAMBE5, it generates excessive size of pre-compiled

object for compiler. As a result the system-time becomes higher. Besides, duplication of the type

and naming information in the Assembly and Symbolic debugging is another limitation in this

system.

Our system is designed by Java under Java Development Kit (JDK 1.4) that outperforms

DAMBE5 performance due to various reasons. DAMBE5 and most of the current Bioinformatics

tools are scripting language based. Scripting languages such as Perl, Python, Rexx and Tcl are

useful for various reasons. Scripting languages are dynamic, powerful for rapid development and

highly portable. However, they are unable to handle large scale data set. Most of the scripting

languages are not object oriented. Consequently, they do not provide strong environment for

variables and function. This drawback makes them unsuitable to handle huge and modular

applications for billions of DNA data volume. Last but not the least, due to their full interpretive

features, these languages are quite slow.

On the contrary, Java (our system) offered advantages to handle garbage collection and

memory allocation. Beside, Java Virtual Machine (JVM) protects data automatically. Java source

codes are primary compiled into Java bytecode. Java provides more design or more facilities

such as interface, abstract classes and more levels of access control. Consequently, our system

offer benefit towards cross platform desktop application.

CONCLUSION

Our analysis is a probabilistic environment that automatically determines position of damaged or

breaks points from given input. Our system can handle data set up to 1650MB base pair. The

pivotal finding is that it can measure the geometrical distances among nucleotide data set which

ensure the memory efficiency of the computing system. Beside, Maximum Likelihood

Estimation enables to determine similar result from irregular data set or sequences and

measurements of minimum geometrical distances that reduce the space size. This approach can

detect any affected DNA sequences where breaks or damages have occurred due to diseases or

any other means. Consequently, the performance of this system is significantly better than

existing tool such as DAMBE5. In future our approach can be applied to test very long data set

of DNA and protein.

AVAILABILITY

The method is implemented in Java and the tool is free to academic users and a version can be

available upon e-mail request to author.

ACKNOWLEDGEMENT

We thank CUET and Tasmani University Australia.

CONFLICT OF INTEREST

We disclose no conflict of interest

REFERENCES:

1. Q.Lu,, R.Lund, T.Lee, An MDL Approach to the Climate Segmentation Problem, The annals of applied statistics 4, 299–319,2010.

2. M.Robbins, C.Gallagher, R.Lund, A.Aue, Mean shift testing in correlated data. Journal of

Time Series Analysis 32, 498–511,2011.

3. M.W.Robbins, R.B.Lund, C.M.Gallagher, Q.Lu, Changepoints in the North Atlantic tropical cyclone record. Journal of the American Statistical Association 106, 89–99,2011.

4. S.Ahmad, S.Duke, R.Jena, M.Williams, N.G.Burnet, Advances in radiotherapy.BMJj 345, 33–38,2012.

5. S.Delaney, S.Jacob, D. Zerbino,E. Birney, Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome research, 18(5):821–829, 2008.

6. G.Schwarz, Estimating the dimension of a model. The Annals of Statistics 6, 461–64,1978.

7. T.Lindahl, D.E.Barnes, Repair of endogenous DNA damage. Cold Spring Harb Symp

Quant Biol 65: 127–133, 2000.

8. E.C.Friedberg , G.C.Walker, W.Siede, R.D.Wood , R.A.Schultz, T.Ellenberger, DNA repair and mutagenesis, 2nd ed. ASM Press, New York,2006.

9. S.P.Jackson and J.Bartek ,The DNA-damage response in human biology and disease,

Nature 461, 1071–1078,2009.

10. A.Ciccia and S.J.Elledge, The DNA damage response: making it safe to play with knives, Mol Cell 40: 179–204, 2010.

11. M.P.Longhese, D.Bonetti, I.Guerini, N.Manfrini, M.Clerici, DNA double-strand breaks

in meiosis: checking their formation, processing and repair,DNA Repair (Amst) 8: 1127– 1138,2009.

12. A.G.Tsai, M.R.Lieber, Mechanisms of chromosomal rearrangement in the human

genome. BMC Genomics 11: S1. doi: 10.1186/1471-2164-11-S1-S1,2010.

13. J.W.Harper, S.J.Elledge,The DNA damage response: ten years after. Mol Cell, 28:739–745,2007

14. J.Rouse, S.P.Jackson, Interfaces between the detection, signaling, and repair of DNA

damage Science,297:547–551,2002.

15. J.C.Harrison, J.E.Haber, Surviving the Breakup: The DNA Damage Checkpoint,Annu Rev Genet. 40:209–235,2006.

16. V.Altmannova,N. Eckert-Boulet,M. Arneric, P.Kolesar, R.Chaloupkova, J.Damborsky,

P.Sung P, X.Zhao, M.Lisby, L.Krejci,Rad52 SUMOylation affects the efficiency of the DNA repair. Nucleic Acids Res 38: 4708–4721,2010.

17. M.R.Lieber, The mechanism of human nonhomologous DNA end joining. J Biol Chem.

2008; 283:1–5,2008.

18. K.A.Cimprich, D.Cortez, ATR: an essential regulator of genome integrity,Nat Rev Mol Cell Biol. 2008;

19. M.B.Kastan and J.Bartek ,Cell-cycle checkpoints and cancer. Nature, 432:316–323,2004.

20. J.Bartek, J.Lukas, DNA damage checkpoints: from initiation to recovery or adaptation.

Curr Opin Cell Biol 19: 238– 245.2007.

21. S.Munoz-Galvan, A.Lopez-Saavedra, S.P.Jackson, P.Huertas, F.Cortes-Ledesma, et al, Competing roles of DNA end resection and non-homologous end joining functions in the repair of replication-born double-strand breaks by sister-chromatid recombination. Nucleic Acids Res 41: 1669–1683,2013.

22. A.Xiao, et al., WSTF regulates the H2A.X DNA damage response via a novel tyrosine

kinase activity. Nature, 457:57–62,2009.

23. P.Huertas, DNA resection in eukaryotes: deciding how to fix the break, Nat Struct Mol Biol 17: 11–16,2010.

24. C.Richardso, N.Horikoshi, T.K.Pandita, The role of the DNA double-strand break

response network in meiosis, DNA Repair, 3:1149–1164, 2004.

25. M.O’Driscoll, P.A.Jeggo,The role of double-strand break repair – insights from human genetics. Nat Rev Genet 7: 45–54.2006.

26. M.McVey, S.E.Lee SE,MMEJ repair of double-strand breaks (director’s cut),deleted

sequences and alternative endings. Trends Genet 24: 529–538,2008.

27. J.R.Chapman, P.Barral, J.B.Vannier, V.Borel, M.Steger, et al., RIF1 is essential for 53BP1-dependent nonhomologous end joining and suppression of DNA double-strand break resection. Mol Cell 49: 858–871,2013.

28. J.Fishman-Lobell, N.Rudin, J.E.Haber,Two alternative pathways of double-strand break

repair that are kinetically separable and independently modulated, Mol Cell Biol, 12: 1292–1303,1992.

29. A.Ciccia, S.J.Elledge, The DNA damage response: making it safe to play with knives.

Mol Cell 40: 179–204,2010.

30. K.A.Bernstein, S.Gangloff, R.Rothstein, The RecQ DNA helicases in DNA repair. Annu Rev Genet 44: 393–417,2010.

31. P.Bork, K.Hofmann, P.Bucher, A.F.Neuwald, S.F.Altschul,E.V.Koonin,A superfamily of

conserved domains in DNA damage-responsive cell cycle checkpoint proteins. FASEB J 11: 68–76,1997.

32. K.W.Caldecott, Single-strand break repair and genetic disease, Nat Rev Genet 9: 619–

631,2008.

33. D.M.Chou, B.Adamson,N.E. Dephoure, X.Tan, A.C.Nottke, K.E.Hurov, S.P.Gygi, M.P.Colaiacovo, S.J.Elledge,A chromatin localization screen reveals poly (ADP ribose)- regulated recruitment of the repressive polycomb and NuRD complexes to sites of DNA damage. Proc Natl Acad Sci 107: 18475–18480, 2010.

34. V.Kumar, A.Grama, A.Gupta , G.Karypis, Introduction to Parallel Computing. Benjamin/Cummings Publ. Company, 1995.

35. V. Kundeti, S. R. S, H. Dinh, M. Vaughn, V. Thapar. Efficient parallel and out of core

algorithms for constructing large bi-directed de bruijn graphs. BMC Bioinformaticse, 11:560, 2010.

36. E. Mardis. Next-generation dna sequencing methods. Annu. Rev. Genomics Hum. Genet.,

9:387–402, 2008.

37. T.Lindahl, D.E.Barnes, Repair of endogenous DNA damage, Cold Spring Harb Symp Quant Biol,65:127–133,2000.

38. M.I.Khan., M.S.Kamal, Sequencing Ontology Alignment for DNA Annotation and Damage Identification, European Journal of Scientific Research, Volume 103 Issue 3,pp 441-450,2013.

39. M.I.Khan., M.S.Kamal. RSAM: An Integrated Algorithm for Local Sequence Alignment.

Archives Des Sciences, Vol 66, No. 5, ISSN 1661-464X,pp,395-412, 2013.

40. J.Antoch, M.Huskova, Z.Praskova, Effect of dependence on statistics for determination of change. Journal of Statistical Planning and Inference 60, 291-310,1997.

41. J.Bai, Least squares estimation of a shift in linear processes. Journal of Time Series Analysis 15, 453-472,1994.

42. H.Waqaar, A. Alex, R. Bharath, An Efficient Algorithm for Local Sequence Alignment,

20-24, 2008.

43. X.Xia, DAMBE5: A comprehensive software package for data analysis in molecular biology and evolution. Molecular Biology and Evolution 30:1720-1728, 2013.

44. I.L.Hofacker , Vienna RNA secondary structure server. Nucleic Acids Res. 31:3429

343,2003.

45. R.A.Vos, J.P.Balhoff , J.A.Caravas JA, et al., NeXML: rich, extensible, and verifiable representation of comparative data and metadata. Syst Biol. 61:675–689, 2012.

Tables:

Table 1: Probabilistic values that determine the formation of the random data set in whole DNA sequences. Here the values in row are for first strand and column determines the values for remaining strand values. This table based on the iteration of only the J values i.e. the I values iteration will be remain constant. In the sense of DNA double strand the second strand nucleotides will be remain unchanged for complete genome. Pi(J)

j

i

A G C T A 0 0.3 0.4 0.3

G 0.2 0.5 0 0.3 C 0 0.6 0.1 0.3 T 0.1 0.4 0.2 0.3

Table 2: Probabilistic table for randomness. This table differ with table 1 only in one consideration that here both (I and J) values will be change for double stand data set. This consideration is powerful to assess the data set more precisely and in faster fashion. Pi(I,J)

j

A G C T A 0 0.2 0.3 0.5

G 0.2 0.2 0.3 0.3

i C 0.2 0.6 0.1 0 T 0.1 0.2 0.2 0.5

Figure Legends: Figure 1: Scatter Probabilistic Diagram. Based on the probabilistic values used in Table 1 the

propagation of random data set of in genome sequences. According to the chance of probabilistic

value the formation of the sequence will be form as shown in this figure. Here initial

probabilities are 0.5 for each nucleotide data set and these probabilities may changes in

subsequence propagation.

Figure 2: Nucleotide Interaction. The Geometric orientation of the nucleotides in data base and

physical memory. A(adenine), G(Guanine), C(Cytosine) and T(Thymine) orientations are

illustrated as circular fashion. Here it is clearly shown that two consecutive nucleotide base pair

distance is πSth/3, where S is the Circumference of the considered circle.

Figure 3: Geometric Formation. The Radian angel that determines how much angular

movement or distance covered by the nucleotide data set. Here r is the radius, C is the distance

covered by the arc, θ is the angle between arc and radius. By the law of the Radian θ=C/r.

Figure 4: An interface of our proposed system. Here the browse button enables to select the

desired file to store data set. Show result button generates the results. We got seven damages

within a double strand DNA sequence. The locations of the exact damages or breaks are also

measured. Changes file button permits to changes the data set.

Figure 5: Platform Independent Automated System for Structural Break Detection

(ASSBD) in DNA Sequence alignment. The base pairs lengths are in X –axis and time of the

system required are placed in Y-axis. The DNA base pair have started from the length 15, 0MB

bp to 165, 0MB bp. The smooth line shows the linearity to the origin. The Automated System

[Platform Independent] shows almost linearity. For data set base pair length over 75MB bp, the

system slightly lower than the linear line due to the uncertain prior mean.

Figure 6: DAMBE5 Experimental result Structural break detection. The same data set is

used in both Platform Independent Automated System and DAMBE5. At the starting point we

have noticed that for 150MB data size, proposed system takes 1.2 ns and DAMBE5 takes 2 ns.

Consequently the second comparisons for the 250MB data set our system takes 1.9 ns and

DAMBE5 takes 3.1 ns. For all data size our system significantly outperforms DAMBE5. The last

timing of the given data set in our system is 11.1 ns and DAMBE5 is 14.88 ns. The (14.88-

11.2/14.88) *100=24.73%, faster than DAMBE5. As the sequence length increase the

performance difference will also increase.

Figure 7: The comparative illustration between our proposed system and DAMBE5. Dark

Blue line shows the timing outcomes for DAMBE5 and Dark Red shows the outcomes for

proposed system. X axis shows the length of base pair and Y axis shows the corresponding time

for specific length of the sequences. Initial length of the data set is 150MB. According to the

DAMBE5 the resultant timing is 2 ns our system accomplish same data set by 1.2 ns. The initial

timing difference is (2-1.2=0.80 ns), second data set the difference is (3.1-1.9= 1.20), third

timing difference is (3.5-2.5=1 ns) for 350 MB data volume. For last stage of the difference is

(10.3- 8.5= 1.8ns) for 1650MB data volumes. The performance of DAMBE5 is decreasing while

the data sizes are increasing due to their lack of robustness for large data set.

Figures:

Figure 1

Figure 2:

Figure 3:

Figure 4:

Figure 5:

Figure 6:

Figure 7:

dna discontinuity analysis: an algorithmic system

Documents