dna discontinuity analysis: an algorithmic system
TRANSCRIPT
DNA Discontinuity Analysis: An Algorithmic System
Md. Sarwar Kamal1*,Sonia Farhana Nimmy2, Mohammad Ibrahim Khan1, Mohammad Shibli Kaysar1, Shuxiang Xu3
Affiliations: 1Department of Computer Science & Engineering, Chittagong University of Engineering and Technology (CUET), Chittagong, 4349,Bangladesh. 2BGC Trust University Bangladesh. 3 University of Tasmania, Australia
* Corresponding Author: Md. Sarwar Kamal
Department of Computer Science & Engineering, Chittagong University of Engineering and
Technology (CUET), Chittagong, 4349,Bangladesh
Cell:8801553315278
Keyword: DNA damage, STACUMSUM, Maximum Likelihood, Geometric Distance.
Running Title: Automated DNA-break detection
Abstract
Damages or breaks in DNA may change the characteristics of genomes and causes various
diseases. In this work we construct a system that incorporates the Maximum Likelihood (ML)
based probabilistic formula to assess the number of damages occurred in any DNA sequences.
This approach has been progressively benchmarked by implementing simulated data set so that
the outcomes can be compared with ground truth or reference value. At first the sequence data
set order is checked through Statistical Cumulative Sum (STACUMSUM). The verified
sequences are then estimated by prior and posterior probability to count the percentages of
breaks and mutations. Maximum Likelihood Estimation (MLE) then finds out the exact numbers
and positions of breaks and detections. In data base manipulation, one factor that decides the
orientation and order of the sequence is geometric distance between consecutive sequences. The
geometric distance is measured for smooth representation of the genome or DNA sequences.
Finally we compared the performance of our system with DAMBE5: (A Comprehensive
Software Package for Data Analysis in Molecular Biology and Evaluation), and in response to
time and space complexity, our Automated System for Structural Break Detection (ASSBD) is
much faster and consumes much less space due to our algorithmic approaches.
INTRODUCTION
Recent development in experimental biotechnology and the advent of next generation
sequencing (NGS) technology have produced large volume of sequence data. The genome
sequences assure the regular features and normal condition of plants, animals or any other
organisms. However, due to external and/or internal factors DNA sequences breaks or damage
may occur. It is a crucial to determine the break points from the long sequences. Generally, a
hypothesis test is used to phrase the structural breaks detections. For example, several recent
works [1-3] have attempted to address the structural break detections. There are various reasons
that may cause structural breaks. One of the pivotal reasons is cancer therapy. Although
chemotherapy is a popular means to prevent the cancer cells among existing treatments, it might
damages the structures of regular DNA sequences [4-6].
Genome integrity is deliberately confronted by DNA lacerations where millions of which
occurs in each human cell within in a year [7]. The maximum frequencies of these lacerations
crop up due to cell metabolism, DNA clone, radiation, excessive use of poisonous environmental
chemicals and programmed DNA laceration in lymphocytes and germ cells [8-12]. The changes
in DNA due to damages can have destructive results and it can finally result in mutations and
chromosomal aberrations. The wrong signal in response of breaks or damages pony up to
crumbling and creates various irregularities such as developmental flaw, neurodegenerative
epidemic, and cancer [9] that focuses the severe demands of proper DNA Damage Response
(DDR) for all living cell and organism growth. DNA damage can occur in both single and in
double strand. Here we mainly designed simulative data-set with DNA single strand breaks
(SSBs) and double-strand breaks (DSBs) to study structural break.
Double strand breaks are the foremost important profile of DNA damage because these
occupy full length of the DNA sequences. DSBs are formulated under the ionizing radiation (IR)
or radiomimetic drugs and also appear in cells treated with topoisomerase II inhibitors.
Sometimes DNA double strand breaks frequencies increases with the changes in singe strands
breaks [8, 9].
To address the DNA damages the researches in dry and wet labs, various tools have been
designed [13-33]. One important factor that provides efficient computing environment is proper
memory size and utilization of memory space. In this regards we have estimated the geometric
distance between two consecutive nucleotide bases. One standard way to reduce the complexity
of bonding nucleotide bases in the memory is to use shared modules [34]. Processor of the
system finds the structural break in round robin fashion [35]. In the age of information super
highway, high quality of DNA sequencing is the critical issues not only in Computer Science but
also for biological and medical problems. Current development of DNA sequencing technology
[36] such as Illumina [http: //www.illumina.com.] and SoLiD [http:
//www.appliedbiosystems.com.] etc has reduced the cost significantly and hence sequence are
data generated in large quantities in every day.
The ultimate target for each and every living organ is to bear its genetic concrete,
flawless, unbroken and unscathed, to the back-to-back propagation. This should be accomplished
in spite of continual aggression by internal and environmental promoter on the DNA. To reverse
this danger, it is very essential to detect DNA damage and structural breaks among sequences,
signal its habitation and conciliate breaks and damage repair. The reactions that impact a large
numbers of organic circumstances are anatomically momentous because they prohibit divergent
human contamination. Besides, it has been found that approximately each of the 1013 cells in the
human body seize tens of thousands of DNA injured per day [37]. These abrasions can stop
genome reproduction and clone, that results mutations in sequences. It will be very efficient and
significant if these damages, breaks and any kind of alternation could be identified under
automated tools or environment. But there are only way to detect the breaks in laboratory which
are always expensive and time consuming. To mitigate the time and cost towards the detections
of breaks, damages or any changes among sequences, we have designed this automated tool.
In this study we have taken in silico approach to design an automated tool to detect any
such DNA damage or breaks. At first we have collected the random DNA sequences in FASTA
format. We demonstrate the Statistical Cumulative Sum (STACUMSUM) to investigate the
order of the collected data set i.e. to check the proper structure and format of the collected data-
set. If any data set loose the order, our system correct it and then compute the damages or breaks
percentages using the prior and posterior probabilities. The final analysis is accomplished by
applying Maximum Likelihood Estimation among all estimated data set to find out the exact
breaks and damages. To ensure the proper memory space geometric distances two consecutive
base pair is measured to design proper memory space in both data base as well as hardware. We
have also measured the performance of this system and DAMBE5 in the view of time and space
complexity. Due to the memory reduction facility our system outperforms DAMBE5 in these
parameters.
METHODS
Statistical Cumulative Sum: Here we have imposed the STACUMSUM, which is an integer representation of complete DNA
sequences. It is a probabilistic process which measures complete sequence structure in a
numerical format. In our previous work [38, 39] we have established the DNA damage
identification using ontological analysis. After detecting the damage of the sequence it is very
efficient to identify the DNA sequence with structural breaks. It is possible to show the damages
of the sequences using the equation of signal to noise model in signal processing. If Z is set of
integer than we define the damage making correlate with Signal to Noise Ratio (SNR) as:
Ztttt qP , ……………………………………(1)
where, q =Signal or data set, ɛt=Damage or Noise of the sequences.
It is analogous to define the complete sequence using the concept of equation 1 to measure
breaks in the sequences. Besides, the structure may be more complicated if successive
nucleotides are not chosen randomly but the probabilities of the nucleotides depend on preceding
data set. In general case of this type selection depends only on the preceding nucleotides. To
make a co-relation between various steps statistical structure can be used by a set of transition
random probabilities. In this case we can say the probability is PrI(J). I and J is the range for all
probable nucleotide set. Diagram Probabilities Pr(I,J) is also a very effective way to narrates the
relationship between Pr(I)=The transition random probabilities, and PrI(J), which is similar with
diagram (Figure 1) of probabilities Pr(I,J) (equation 2):
Pr(I)= JJ
IJIJJI ))(Pr(),Pr(),Pr(
)(Pr)Pr(),Pr( JIJI I
)2(..............................1),Pr()Pr()(Pr,
JIIJ
I JIIj
The full probabilities set (Table 1 and 2) shows the probable formation of the data set.
A sequence is built according to the probabilistic values form the Table 1 and 2:
AAA TCC TCG TTA TTT TTG TAG TAC TCG GCT GGG GAC GGA AGC AGT TGG AGC
AGT TCC CCT CGC CTA TAC TTG.
To select random nucleotides base pair, this table helps to automatically generate the sequences
for our tool. Table 1 is based on the iteration of only the J values i.e. the I values iteration will be
remain constant. In the sense of DNA double strand the second strand nucleotides will be remain
unchanged for complete genome. Table 2 reflects both (I and J) values will be change for double
stand data set. This consideration is powerful to assess the data set more precisely and in faster
fashion. If random probabilistic value on A,G,T,C are 0.1,0.1,0.1,0.3,0.4,0.6 then we will get the
above sequence. The data set In case of Structural break the above sequence will be interrupted
and different new signal will be generate. Later we have checked the structural break for Prior
and Posterior probabilistic system.
Structural Break Prior Mean: For countably infinite nucleotides in DNA sequences, let consider n observations P1, P2
……………..Pn for some real valued stochastic process in equation 1 with null hypothesis of
fixed prior means H0:µ1=……………………µn. Sometimes the hypotheses are not always true.
So the discrete Statistical Cumulative Sum for distinct values [0,1] as
)3.(].........1,0[1)(1 1
X
nx
t
nx
tttn P
nnxP
nXZ
According to the Functional Central Limit Theorem (FCLT) the equation 2 cannot be imposed
directly. In that case standardized partial sum process is essentially applied:
)4..(....................].........1,0[1)(1
Xn
XSnnx
tt
The evaluation of hypothesis H0 under the test argument X=k/n. Consequently the
STACUMSUM process Zn(X) compare the sample mean and global mean for all observations.
Since the structural break of DNA sequence timing is unknown, we have checked all possible
choices K €{1, 2, ………………n}. This results maximum outcomes of the Structural breaks.
)5..(....................).........(max1nkZM n
nn
Where, ω> 0, is scaling parameter of Nucleotides sequences in any DNA segments.
Structural Break Posterior Mean:
The structural break in Unconditional mean narrated at prior mean can be measured in
Conditional environment so that liner regression model will imposed on the Nucleotide data set.
To do that, the break can be measureable under multidimensional covariate data set as DNA
segments. If A,G,C,T,A,G,C,T,…………….is countable infinite sequence of some mutually
exclusive Nucleotide data set then we can say that
.)()|Pr()Pr()Pr(11
i
iii
i GPGAGAA
If A, C and T are nucleotides, where A and C are conditionally independent based on G.
Therefore, the conditional independence for the structural break will be
).|Pr()|Pr()|Pr( GCGAGCA Again T be a random Nucleotide and Q be an DNA data
segment such as Pr(Q)> 0. So the posterior distribution of T given Q will be
.)Pr(
)}Pr({)|Pr()|(| QQtTQtTQTF QT
When T is discrete the posterior probability mass function of T given the DNA segment Q and
the conditional expectation of T given A are
)()}Pr({}|Pr{)|(| QP
QtTQtTQTp QrT
.)|(]|[ |x
QrT QTptQTE Consequently if T is continues then the expectation will be the
different as previous. .)|(]|[ |
t
AT QTdFtQTE For double Nucleotides T and C under discrete
structural break, then the conditional probability of T given C=c and conditional expectation T of
given C=c are
)(),(
}Pr{},Pr{)|( ,
| cpctp
cCcCtTctp
r
CrTCrT
.)|(]|[ |x
CrT ctptcCTE During the continuity for same data set we have the following
impacts as )(
),()|( ,
| cfctf
ctfC
CTCT
.)|(]|[ |
t
CT dcctftcCTE To find out the expectations for all data set of DNA sequences it is
very important to have iterative Expectations. The Iterated expectations of these two Nucleotides
set we can find the followings:
][
)(
),(
)()|(
)(]|[]]|[[
,
|
TE
dttft
dtdcctft
dccfdtctft
dccftCTECTEE
tT
t yCT
cYC
tCT
cC
For all random Nucleotides the overall expectation can be measured by summing the total data
set. Suppose N is a positive number for all DNA segments, then the total expectation is
].[
][
][
]|[][
1
1
11
TEn
TE
TE
nNTEnNTE
n
ii
n
ii
n
ii
N
ii
Instead of Thiamine (T) the posterior probability will be same for remaining Adenine (A),
Guanine (G) and Cytosine(C).
Estimation of Maximum Likelihood:
To select a model for handing unknown nucleotide set or parameters, we define the probability
of set observations for any DNA segments under certain conditions. Set of resultant outcomes
have been measured in real world problems. The measured observations help to choose a set of
parameters in the experiments which are most likely to generate the observed results. The
observation of the outcomes based on the pivotal parameters, called the Maximum Likely Hood
(MLE) estimation, has provided consistent and efficient results in structural break finding [40,
41]. Nucleotide data set with a few structural breaks, as for example, in 10 DNA segments with 5
defective for structural break easily estimates that 50% is defective, however, for large and
uncertain data set it is necessary to have an established formula. For a nucleotide data set of n
length and the probability of structural break if M, then
)-(1M)!-(nM!
n!=P M-nMr
is the ratio of structural breaks in the total data set. In a word we can define the likelihood as
LH(parameter | data) = P(data | parameter). According to this equation we measure the values
using log likelihood as follows:
)5........()1(log)(log(
mnx
mn
LH
For lots of parameters we have imposed Taylor Series expansion for Maximum Likelihood
estimation as: :
O(1)+)-())(LH()-0.5(+
))(LH()-(+))(LH(=))(LH(
2
ˆlogˆ
logˆˆloglog 0=)SH(=))(LH(
log
)IH(= ))(LH(2
log
So,,
)-)(IH()-0.5(+))(LH(=))(LH( ˆˆˆloglog )-)(IH()-0.5.(=))(LH(-))(LH( rrr ˆˆˆlogˆlog from the above illustrations equation (5) helps to calculate the maximum value for any DNA
sequence data set or Genome segments. The later equations are the total process of Maximum
Likelihood Estimation.
Estimation of Minimum Geometric Distance: We know that the human genome has 3.1billiion base pairs. To maintain hardware flexibility and
memory efficiency it is very important to know the minimum distance for any two consecutive
base pair or structural breaks. For Hardware support, we have to determine the proper
geometrical architecture for consecutive DNA base pair. In the molecular data set of DNA base
as ATGC, some non-overlapping (data set without breaks) data set must be used. Each base has
to move a half part of the bond in a double bond DNA sequences. Besides, all base pair should
be synchronized so that comparisons for breaks start at the same time. Suppose, a shared
memory synchronous a combined machine in which a set of g processor can fetches to a set of h
memory modules in parallel. It is obvious that all the data set of DNA sequences are connected
to the memory modules through a switching networking system to maintain smooth memory
access and retrieval. In a word, the access procedures are called Distributed Memory Machine
(DMM). We have illustrated the geometrical orientation of Nucleotide bases for parallel and
regular memory (Figure 2). Suppose the threshold range between two consecutive nucleotide =
Sth; the angular distance of the memory location = π; the radial distance of the memory modules=
Dr; for a circle we know that circumference C=2πr; the angel between arch length and radius =θ;
For radian angle (Figure 3) we know that θ=C/r; the circumference of the memory module for
remaining part of the angle (π-θ) will be
thSC )(2
and
th
Drd
SHypotenousBaseCos
2
)3
,2
()2
(1
st
Drd
SCos
.
If,
3600
Now the arch length for this angel will be
3
SthSC th
So for the minimum distance for two consecutive Nucleotides will be as follows
3
tanthS
CArclength
nceCircumfereceMinimumDis
th
thth
SS
SthS )(6
3
)(2
Implementation: We have implemented and experimented under the environments of Java with Integrated
Development Environment (IDE) Netbeans. The object oriented implementation helped us to
perform the nucleotides (A, C, T, and G) as a distinct object. In our previous work [38, 39] we
have improved the performance of [42] and observed that our RSAM algorithm is significantly
better in the terms of speed, complexity, space, sensitivity, accuracy and risk. Here, we have
integrated the methods under the format of System. The system as well as the analysis makes it
different than any other systems. We have designed it as an automated system (Figure 4).
RESULTS AND DISCUSSION
The outcomes from this system are noted as how much time the systems consuming for various
length of data set starting from 160MB and ends with 1650MB. Due to the algorithmic design
and analysis, this system takes less time for processing nucleotide base pairs (Figure 5). From
existing tool, we imposed same data set on DAMBE5: A Comprehensive Software Package for
Data Analysis in Molecular Biology and Evaluation [43]. The results from this system (Figure 6)
depicts that in all respects of data set it takes more times. The comparison between these two
systems (Figure 7) clearly noticed the efficiency between these two tools.
We have compared our system with the DAMBE5 [43]. DAMBE5 although useful for
sequence retrieval, motif characterization, codon adaptation index, molecular phylogetics
sequencing, etc however there are no algorithmic comparisons of time and space complexity. In
our system we have addressed time and space complexity most. To handle large and big data set
the it is imperative that a system perform faster as well as take less space. For same data set of
DNA nucleotides base our system can perform significantly better than that of DAMBE5 due to
less formatting and structures (Figure 6, 7). Although, our system supports less formatting and
structures, it occupies much less space and takes much less time. In current era, less time
consumption is very essential for various systems and environments. Besides, only algorithmic
development can improve the performances of the computing devices.
DNA Sequence Breaks Comparisons: We have developed the system based on the Maximum Likelihood, Posterior Mean and Prior
Mean. This system used primary and secondary data set of DNA segments collected from NCBI
data base (http://www.ncbi.nlm.nih.gov). Based on the data set the system works very effectively.
On the other hand DAMBE takes the input as Vernna RNA Secondary Structure library [44] to
determine the secondary structure of RNA and compare their minimum folding energy.
Both the cases, DAMBE requires certain formats of data set. According to the
algorithmic foundation our automated system works on platform independent structures, lengths
and formats. The core features of independency come from the Java Development Toolkit (JDK)
environment. The system we have designed here is based on Object Oriented Programming
(OOP) comparing with fixed formatted languages. The result of the proposed system (Figure 5)
depicts the outcomes of our algorithmic process. Consequently the results of DAMBE5 (Figure 6)
shows that it takes more time than that of our system for the same data set.
DAMBE5 supports multiple formats of the DNA Segments structures [45]. For the same
data set of DNA segments and lengths, DAMBE5 generate different results compared to our
system (Figure 5 and 6). We have checked the reasons behind the over timing on this system and
found that the compiler takes much time to compile the structures of the DNA segments.
The main differences between our system and DAMBE5 (Figure 7) are in compilation
time, size of the generated objects, dynamic memory usage during compilation and template
instantiation time.
Our proposed system takes less compilation time. This matrix is determined based on link
time. The link time affect the productivity of the system. DAMBE5 uses the shell timing which a
distinction between user and system time but the difference is not a meaningful to the users. For
Java, the compilation does not support the incremental compilation below the granularity of a
whole module. Our system only takes the CPU intensive time and that why it is better than that
of other tools.
Due to the excessive data set on DAMBE5, it generates excessive size of pre-compiled
object for compiler. As a result the system-time becomes higher. Besides, duplication of the type
and naming information in the Assembly and Symbolic debugging is another limitation in this
system.
Our system is designed by Java under Java Development Kit (JDK 1.4) that outperforms
DAMBE5 performance due to various reasons. DAMBE5 and most of the current Bioinformatics
tools are scripting language based. Scripting languages such as Perl, Python, Rexx and Tcl are
useful for various reasons. Scripting languages are dynamic, powerful for rapid development and
highly portable. However, they are unable to handle large scale data set. Most of the scripting
languages are not object oriented. Consequently, they do not provide strong environment for
variables and function. This drawback makes them unsuitable to handle huge and modular
applications for billions of DNA data volume. Last but not the least, due to their full interpretive
features, these languages are quite slow.
On the contrary, Java (our system) offered advantages to handle garbage collection and
memory allocation. Beside, Java Virtual Machine (JVM) protects data automatically. Java source
codes are primary compiled into Java bytecode. Java provides more design or more facilities
such as interface, abstract classes and more levels of access control. Consequently, our system
offer benefit towards cross platform desktop application.
CONCLUSION
Our analysis is a probabilistic environment that automatically determines position of damaged or
breaks points from given input. Our system can handle data set up to 1650MB base pair. The
pivotal finding is that it can measure the geometrical distances among nucleotide data set which
ensure the memory efficiency of the computing system. Beside, Maximum Likelihood
Estimation enables to determine similar result from irregular data set or sequences and
measurements of minimum geometrical distances that reduce the space size. This approach can
detect any affected DNA sequences where breaks or damages have occurred due to diseases or
any other means. Consequently, the performance of this system is significantly better than
existing tool such as DAMBE5. In future our approach can be applied to test very long data set
of DNA and protein.
AVAILABILITY
The method is implemented in Java and the tool is free to academic users and a version can be
available upon e-mail request to author.
ACKNOWLEDGEMENT
We thank CUET and Tasmani University Australia.
CONFLICT OF INTEREST
We disclose no conflict of interest
REFERENCES:
1. Q.Lu,, R.Lund, T.Lee, An MDL Approach to the Climate Segmentation Problem, The annals of applied statistics 4, 299–319,2010.
2. M.Robbins, C.Gallagher, R.Lund, A.Aue, Mean shift testing in correlated data. Journal of
Time Series Analysis 32, 498–511,2011.
3. M.W.Robbins, R.B.Lund, C.M.Gallagher, Q.Lu, Changepoints in the North Atlantic tropical cyclone record. Journal of the American Statistical Association 106, 89–99,2011.
4. S.Ahmad, S.Duke, R.Jena, M.Williams, N.G.Burnet, Advances in radiotherapy.BMJj 345, 33–38,2012.
5. S.Delaney, S.Jacob, D. Zerbino,E. Birney, Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome research, 18(5):821–829, 2008.
6. G.Schwarz, Estimating the dimension of a model. The Annals of Statistics 6, 461–64,1978.
7. T.Lindahl, D.E.Barnes, Repair of endogenous DNA damage. Cold Spring Harb Symp
Quant Biol 65: 127–133, 2000.
8. E.C.Friedberg , G.C.Walker, W.Siede, R.D.Wood , R.A.Schultz, T.Ellenberger, DNA repair and mutagenesis, 2nd ed. ASM Press, New York,2006.
9. S.P.Jackson and J.Bartek ,The DNA-damage response in human biology and disease,
Nature 461, 1071–1078,2009.
10. A.Ciccia and S.J.Elledge, The DNA damage response: making it safe to play with knives, Mol Cell 40: 179–204, 2010.
11. M.P.Longhese, D.Bonetti, I.Guerini, N.Manfrini, M.Clerici, DNA double-strand breaks
in meiosis: checking their formation, processing and repair,DNA Repair (Amst) 8: 1127– 1138,2009.
12. A.G.Tsai, M.R.Lieber, Mechanisms of chromosomal rearrangement in the human
genome. BMC Genomics 11: S1. doi: 10.1186/1471-2164-11-S1-S1,2010.
13. J.W.Harper, S.J.Elledge,The DNA damage response: ten years after. Mol Cell, 28:739–745,2007
14. J.Rouse, S.P.Jackson, Interfaces between the detection, signaling, and repair of DNA
damage Science,297:547–551,2002.
15. J.C.Harrison, J.E.Haber, Surviving the Breakup: The DNA Damage Checkpoint,Annu Rev Genet. 40:209–235,2006.
16. V.Altmannova,N. Eckert-Boulet,M. Arneric, P.Kolesar, R.Chaloupkova, J.Damborsky,
P.Sung P, X.Zhao, M.Lisby, L.Krejci,Rad52 SUMOylation affects the efficiency of the DNA repair. Nucleic Acids Res 38: 4708–4721,2010.
17. M.R.Lieber, The mechanism of human nonhomologous DNA end joining. J Biol Chem.
2008; 283:1–5,2008.
18. K.A.Cimprich, D.Cortez, ATR: an essential regulator of genome integrity,Nat Rev Mol Cell Biol. 2008;
19. M.B.Kastan and J.Bartek ,Cell-cycle checkpoints and cancer. Nature, 432:316–323,2004.
20. J.Bartek, J.Lukas, DNA damage checkpoints: from initiation to recovery or adaptation.
Curr Opin Cell Biol 19: 238– 245.2007.
21. S.Munoz-Galvan, A.Lopez-Saavedra, S.P.Jackson, P.Huertas, F.Cortes-Ledesma, et al, Competing roles of DNA end resection and non-homologous end joining functions in the repair of replication-born double-strand breaks by sister-chromatid recombination. Nucleic Acids Res 41: 1669–1683,2013.
22. A.Xiao, et al., WSTF regulates the H2A.X DNA damage response via a novel tyrosine
kinase activity. Nature, 457:57–62,2009.
23. P.Huertas, DNA resection in eukaryotes: deciding how to fix the break, Nat Struct Mol Biol 17: 11–16,2010.
24. C.Richardso, N.Horikoshi, T.K.Pandita, The role of the DNA double-strand break
response network in meiosis, DNA Repair, 3:1149–1164, 2004.
25. M.O’Driscoll, P.A.Jeggo,The role of double-strand break repair – insights from human genetics. Nat Rev Genet 7: 45–54.2006.
26. M.McVey, S.E.Lee SE,MMEJ repair of double-strand breaks (director’s cut),deleted
sequences and alternative endings. Trends Genet 24: 529–538,2008.
27. J.R.Chapman, P.Barral, J.B.Vannier, V.Borel, M.Steger, et al., RIF1 is essential for 53BP1-dependent nonhomologous end joining and suppression of DNA double-strand break resection. Mol Cell 49: 858–871,2013.
28. J.Fishman-Lobell, N.Rudin, J.E.Haber,Two alternative pathways of double-strand break
repair that are kinetically separable and independently modulated, Mol Cell Biol, 12: 1292–1303,1992.
29. A.Ciccia, S.J.Elledge, The DNA damage response: making it safe to play with knives.
Mol Cell 40: 179–204,2010.
30. K.A.Bernstein, S.Gangloff, R.Rothstein, The RecQ DNA helicases in DNA repair. Annu Rev Genet 44: 393–417,2010.
31. P.Bork, K.Hofmann, P.Bucher, A.F.Neuwald, S.F.Altschul,E.V.Koonin,A superfamily of
conserved domains in DNA damage-responsive cell cycle checkpoint proteins. FASEB J 11: 68–76,1997.
32. K.W.Caldecott, Single-strand break repair and genetic disease, Nat Rev Genet 9: 619–
631,2008.
33. D.M.Chou, B.Adamson,N.E. Dephoure, X.Tan, A.C.Nottke, K.E.Hurov, S.P.Gygi, M.P.Colaiacovo, S.J.Elledge,A chromatin localization screen reveals poly (ADP ribose)- regulated recruitment of the repressive polycomb and NuRD complexes to sites of DNA damage. Proc Natl Acad Sci 107: 18475–18480, 2010.
34. V.Kumar, A.Grama, A.Gupta , G.Karypis, Introduction to Parallel Computing. Benjamin/Cummings Publ. Company, 1995.
35. V. Kundeti, S. R. S, H. Dinh, M. Vaughn, V. Thapar. Efficient parallel and out of core
algorithms for constructing large bi-directed de bruijn graphs. BMC Bioinformaticse, 11:560, 2010.
36. E. Mardis. Next-generation dna sequencing methods. Annu. Rev. Genomics Hum. Genet.,
9:387–402, 2008.
37. T.Lindahl, D.E.Barnes, Repair of endogenous DNA damage, Cold Spring Harb Symp Quant Biol,65:127–133,2000.
38. M.I.Khan., M.S.Kamal, Sequencing Ontology Alignment for DNA Annotation and Damage Identification, European Journal of Scientific Research, Volume 103 Issue 3,pp 441-450,2013.
39. M.I.Khan., M.S.Kamal. RSAM: An Integrated Algorithm for Local Sequence Alignment.
Archives Des Sciences, Vol 66, No. 5, ISSN 1661-464X,pp,395-412, 2013.
40. J.Antoch, M.Huskova, Z.Praskova, Effect of dependence on statistics for determination of change. Journal of Statistical Planning and Inference 60, 291-310,1997.
41. J.Bai, Least squares estimation of a shift in linear processes. Journal of Time Series Analysis 15, 453-472,1994.
42. H.Waqaar, A. Alex, R. Bharath, An Efficient Algorithm for Local Sequence Alignment,
20-24, 2008.
43. X.Xia, DAMBE5: A comprehensive software package for data analysis in molecular biology and evolution. Molecular Biology and Evolution 30:1720-1728, 2013.
44. I.L.Hofacker , Vienna RNA secondary structure server. Nucleic Acids Res. 31:3429
343,2003.
45. R.A.Vos, J.P.Balhoff , J.A.Caravas JA, et al., NeXML: rich, extensible, and verifiable representation of comparative data and metadata. Syst Biol. 61:675–689, 2012.
Tables:
Table 1: Probabilistic values that determine the formation of the random data set in whole DNA sequences. Here the values in row are for first strand and column determines the values for remaining strand values. This table based on the iteration of only the J values i.e. the I values iteration will be remain constant. In the sense of DNA double strand the second strand nucleotides will be remain unchanged for complete genome. Pi(J)
j
i
A G C T A 0 0.3 0.4 0.3
G 0.2 0.5 0 0.3 C 0 0.6 0.1 0.3 T 0.1 0.4 0.2 0.3
Table 2: Probabilistic table for randomness. This table differ with table 1 only in one consideration that here both (I and J) values will be change for double stand data set. This consideration is powerful to assess the data set more precisely and in faster fashion. Pi(I,J)
j
A G C T A 0 0.2 0.3 0.5
G 0.2 0.2 0.3 0.3
i C 0.2 0.6 0.1 0 T 0.1 0.2 0.2 0.5
Figure Legends: Figure 1: Scatter Probabilistic Diagram. Based on the probabilistic values used in Table 1 the
propagation of random data set of in genome sequences. According to the chance of probabilistic
value the formation of the sequence will be form as shown in this figure. Here initial
probabilities are 0.5 for each nucleotide data set and these probabilities may changes in
subsequence propagation.
Figure 2: Nucleotide Interaction. The Geometric orientation of the nucleotides in data base and
physical memory. A(adenine), G(Guanine), C(Cytosine) and T(Thymine) orientations are
illustrated as circular fashion. Here it is clearly shown that two consecutive nucleotide base pair
distance is πSth/3, where S is the Circumference of the considered circle.
Figure 3: Geometric Formation. The Radian angel that determines how much angular
movement or distance covered by the nucleotide data set. Here r is the radius, C is the distance
covered by the arc, θ is the angle between arc and radius. By the law of the Radian θ=C/r.
Figure 4: An interface of our proposed system. Here the browse button enables to select the
desired file to store data set. Show result button generates the results. We got seven damages
within a double strand DNA sequence. The locations of the exact damages or breaks are also
measured. Changes file button permits to changes the data set.
Figure 5: Platform Independent Automated System for Structural Break Detection
(ASSBD) in DNA Sequence alignment. The base pairs lengths are in X –axis and time of the
system required are placed in Y-axis. The DNA base pair have started from the length 15, 0MB
bp to 165, 0MB bp. The smooth line shows the linearity to the origin. The Automated System
[Platform Independent] shows almost linearity. For data set base pair length over 75MB bp, the
system slightly lower than the linear line due to the uncertain prior mean.
Figure 6: DAMBE5 Experimental result Structural break detection. The same data set is
used in both Platform Independent Automated System and DAMBE5. At the starting point we
have noticed that for 150MB data size, proposed system takes 1.2 ns and DAMBE5 takes 2 ns.
Consequently the second comparisons for the 250MB data set our system takes 1.9 ns and
DAMBE5 takes 3.1 ns. For all data size our system significantly outperforms DAMBE5. The last
timing of the given data set in our system is 11.1 ns and DAMBE5 is 14.88 ns. The (14.88-
11.2/14.88) *100=24.73%, faster than DAMBE5. As the sequence length increase the
performance difference will also increase.
Figure 7: The comparative illustration between our proposed system and DAMBE5. Dark
Blue line shows the timing outcomes for DAMBE5 and Dark Red shows the outcomes for
proposed system. X axis shows the length of base pair and Y axis shows the corresponding time
for specific length of the sequences. Initial length of the data set is 150MB. According to the
DAMBE5 the resultant timing is 2 ns our system accomplish same data set by 1.2 ns. The initial
timing difference is (2-1.2=0.80 ns), second data set the difference is (3.1-1.9= 1.20), third
timing difference is (3.5-2.5=1 ns) for 350 MB data volume. For last stage of the difference is
(10.3- 8.5= 1.8ns) for 1650MB data volumes. The performance of DAMBE5 is decreasing while
the data sizes are increasing due to their lack of robustness for large data set.