dna compression (encoded using huffman encoding method)
TRANSCRIPT
DNA compression(Encoded using Huffman Coding Method)
Marwa K. Al-RikabyUniversity of Babylon/ College of Information
Technology
DNAOne of the building blocks in the organisms bodies.Consists of four chemical bases:
Adenine (A). Thymine (T). Cytosine (C). Guanine (G).
DNA bases pair up with each other, A with T and C with G, to form units called base pairs.
DNA in humans contains around 3 billion bases and these are similar in two persons for about 99% of the total bases.
DNA Compression BasesGoal: analyzing, saving space and time.The DNA sequences constructed from the alphabet {A,
T, C, G}, and those sequences have various repeats usually approximate.
Only lossless algorithms are valid.DNA compression model is preferred to be:
Based on a biological knowledge.Give compression.Simple, few parameters.Can give per symbol information content.Efficient algorithm.
How to compress DNA?Since DNA sequences only contain the four bases {a, c, g,
t} they can be stored using two bits per input symbol.The standard compression tools, such as gzip and bzip,
usually fail to achieve any compression since they use more than two bits per symbol.
When compressing 229354 bases (57338 bytes), we get:
HEHCMVCG: 57338 bytes (without compression). gzip: 66741 bytes (negative compression).
bzip2: 62169 bytes (negative compression).
How to compress DNA?In the case of multiple genomes from the same species,
associated with ‘resequencing’ technologies, the flat text file approach is clearly wasteful since for the most part the sequences are identical.
A simple approach is to store a reference sequence, and then for each other sequence, encode only the differences (or ‘deltas’) with respect to the original sequence.
Consider the sequences AACGACTAGTAATTTG and CACGTCTAGTAATGTG which are identical, except for a substitution in position 1 (A→C), 5 (A→T) and 14 (T→G). Each SNP can be encoded by a pair (i, X), where i is an integer encoding the position and X represents the value of the substitution relative to the reference.
How to compress DNA? Although the basic idea is easy to understand, and not new, a
precise implementation requires addressing a number of important technical issues:
One can use local relative addresses, i.e. intervals, rather than absolute addresses. Using intervals, the above example ‘1C5T14G’becomes ‘0C4T9G’. With intervals the dynamic range of the integers to be encoded may be considerably smaller than with absolute addresses. The relatively modest price to pay is that intervals must be added to recover absolute coordinates.
If the positions at which variations occur in the population are fixed and form a relatively small subset of all possible positions, then additional savings may result by focusing only on those positions.
The choice of the reference sequence.
How to compress DNA?All applications of the basic ideas hinge on a fundamental
technical problem: how to encode integers, representing for instance absolute or relative genomic addresses or read lengths, into binary strings?
we are interested in binary encoding schemes for sequences of integers that can be parsed automatically and that, consistently with information theory, are entropy efficient, in the sense that fewer bits are used to encode more frequent events.
How to compress DNA? Common components of most of DNA
compression algorithms:
Finding the candidate repeat segments. Considering approximate repeats. Selecting the best subset of compatible
repeats. Encoding of the repeat segments. Encoding of the non-repeat segments.
How to compress DNA?Suppose we have the following DNA sequence:
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its
repetitions in the example.
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its
repetitions in the example.
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its
repetitions in the example.
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its
repetitions in the example.
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its
repetitions in the example.
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its
repetitions in the example.
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its
repetitions in the example.
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its
repetitions in the example.
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its
repetitions in the example.
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
How to compress DNA?1. Finding the candidate repeat segments:Let “TGATAG” be a candidate segment, so we’ll find its
repetitions in the example.
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
How to compress DNA? The total number of “TGATAG” occurrences is
14.
All segments repetitions should be indicated in this way.
The counted numbers are kept for using in the encoding.
How to compress DNA?2. Considering approximate repeats: Scanning the sequence to find out any similarity
between the segments, i.e. segments can be identical after applying any operation from the four basic operations:
Insertion: “AAATTCG”==“AAATTCTG” after Ins(T,6). Deletion: “AAATTCG”==“AAATTG” after Del(5,1). Replacement: “AAATTCG”==“AATTTCG” after
Rep(2,T). Reverse: “AAATTCG”==“GCTTAAA” after Rev().
How to compress DNA?2. Considering approximate repeats:let “ATATGA” be a reference segment, then “ATATCA” is
identical to it if we replace “G” by “C”
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
How to compress DNA?2. Considering approximate repeats: “ATAGA” is identical to “ATATGA” when deleting “T” at
position 3.
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
How to compress DNA?2. Considering approximate repeats: “ATATGA” is identical to “ATAGA” when deleting “T” at
position 3.“GGCGC” is identical to “GGCGG” when replacing “C” by “G”
at position 4. TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
How to compress DNA?2. Considering approximate repeats: “AATGG” is identical to “GGTAA” when reversing it.
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
How to compress DNA?3. Selecting the best subset of compatible repeats: The choosing of the reference segment is a major and a very
sensitive process since the design of the reference sequence impacts not only the variants to be recorded, but also the intervals, and therefore it must also take into consideration any constraints a particular implementation may place on the intervals and their encodings.
In our example, The segments that we have detected should have integer numbers pointing to its indexes in the reference table.
Segment IndexA 0T 1C 2
G 3
TGATAG 4
ATATGA 5
AAATTCG 6
GGTAA 7
GGCGC 8
RepC 9
Del 10
InsT 11
Rev 12
RepG 13
RepT 14
The reference table contains:
• The four basic symbols {A, T, G, C}.
• The candidates segments.
• The basic operations, each one with the available parameters applied on the sequence.
4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must
be counted in the same way shown in step 1.Segment Index repetitions
A 0T 1C 2G 3TGATAG 4ATATGA 5AAATTCG 6GGTAA 7GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must
be counted in the same way shown in step 1.Segment Index repetitions
A 0T 1C 2G 3TGATAG 4ATATGA 5AAATTCG 6GGTAA 7GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must
be counted in the same way shown in step 1.Segment Index repetitions
A 0T 1C 2G 3TGATAG 4 14ATATGA 5AAATTCG 6GGTAA 7GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must
be counted in the same way shown in step 1.Segment Index repetitions
A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1
AAATTCG 6GGTAA 7GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must
be counted in the same way shown in step 1.Segment Index repetitions
A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1
AAATTCG 6 1
GGTAA 7GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must
be counted in the same way shown in step 1.Segment Index repetitions
A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1
AAATTCG 6 1
GGTAA 7 1
GGCGC 8RepC 9Del 10InsT 11Rev 12RepG 13RepT 14
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must
be counted in the same way shown in step 1.Segment Index repetitions
A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1
AAATTCG 6 1
GGTAA 7 1
GGCGC 8 1
RepC 9Del 10InsT 11Rev 12RepG 13RepT 14
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
4. Encoding of the Repeat segment: initially the repetitions of each candidate segment must
be counted in the same way shown in step 1.Segment Index repetitions
A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1
AAATTCG 6 1
GGTAA 7 1
GGCGC 8 1
RepC 9Del 10InsT 11Rev 12RepG 13RepT 14
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same
way shown in step 2. Segment Index repetitions
A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1
AAATTCG 6 1
GGTAA 7 1
GGCGC 8 1
RepC 9Del 10InsT 11Rev 12RepG 13RepT 14
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same
way shown in step 2. Segment Index repetitions
A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1 +1
AAATTCG 6 1
GGTAA 7 1
GGCGC 8 1
RepC 9Del 10InsT 11Rev 12RepG 13 1
RepT 14
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same
way shown in step 2. Segment Index repetitions
A 0T 1C 2G 3TGATAG 4 14ATATGA 5 1 +1 +1
AAATTCG 6 1
GGTAA 7 1
GGCGC 8 1
RepC 9Del 10 1
InsT 11Rev 12RepG 13 1
RepT 14
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same
way shown in step 2. Segment Index repetitions
A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3
AAATTCG 6 1 +1
GGTAA 7 1
GGCGC 8 1
RepC 9Del 10 1 +1
InsT 11Rev 12RepG 13 1
RepT 14
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same
way shown in step 2. Segment Index repetitions
A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3
AAATTCG 6 1 +1 +1
GGTAA 7 1
GGCGC 8 1
RepC 9Del 10 2
InsT 11 1
Rev 12RepG 13 1
RepT 14
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same
way shown in step 2. Segment Index repetitions
A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3
AAATTCG 6 1 +1 +1 +1
GGTAA 7 1
GGCGC 8 1
RepC 9Del 10 2
InsT 11 1 +1
Rev 12RepG 13 1
RepT 14
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same
way shown in step 2. Segment Index repetitions
A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3
AAATTCG 6 4
GGTAA 7 1 +1
GGCGC 8 1
RepC 9Del 10 2
InsT 11 2
Rev 12 1
RepG 13 1
RepT 14
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same
way shown in step 2. Segment Index repetitions
A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3
AAATTCG 6 4
GGTAA 7 1 +1 +1
GGCGC 8 1
RepC 9Del 10 2
InsT 11 2
Rev 12 1
RepG 13 1
RepT 14 1
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same
way shown in step 2. Segment Index repetitions
A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3
AAATTCG 6 4
GGTAA 7 3
GGCGC 8 1 +1
RepC 9 1
Del 10 2
InsT 11 2
Rev 12 1
RepG 13 1
RepT 14 1
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same
way shown in step 2. Segment Index repetitions
A 0T 1C 2G 3TGATAG 4 14ATATGA 5 3
AAATTCG 6 4
GGTAA 7 2
GGCGC 8 2
RepC 9 1
Del 10 2
InsT 11 2
Rev 12 1
RepG 13 1
RepT 14 1
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same
way shown in step 2. Segment Index repetitions
A 0T 1C 2G 3 25
TGATAG 4 14ATATGA 5 3
AAATTCG 6 4
GGTAA 7 2
GGCGC 8 2
RepC 9 1
Del 10 2
InsT 11 2
Rev 12 1
RepG 13 1
RepT 14 1
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same
way shown in step 2. Segment Index repetitions
A 0T 1 19
C 2G 3 25
TGATAG 4 14ATATGA 5 3
AAATTCG 6 4
GGTAA 7 2
GGCGC 8 2
RepC 9 1
Del 10 2
InsT 11 2
Rev 12 1
RepG 13 1
RepT 14 1
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same
way shown in step 2. Segment Index repetitions
A 0 28
T 1 19
C 2G 3 25
TGATAG 4 14ATATGA 5 3
AAATTCG 6 4
GGTAA 7 2
GGCGC 8 2
RepC 9 1
Del 10 2
InsT 11 2
Rev 12 1
RepG 13 1
RepT 14 1
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same
way shown in step 2. Segment Index repetitions
A 0 28
T 1 19
C 2 14
G 3 25
TGATAG 4 14ATATGA 5 3
AAATTCG 6 4
GGTAA 7 2
GGCGC 8 2
RepC 9 1
Del 10 2
InsT 11 2
Rev 12 1
RepG 13 1
RepT 14 1
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
5. Encoding of the Non-Repeat segment: The approximate segment must be processed in the same
way shown in step 2. Segment Index repetitions
A 0 28
T 1 19
C 2 14
G 3 25
TGATAG 4 14ATATGA 5 3
AAATTCG 6 4
GGTAA 7 2
GGCGC 8 2
RepC 9 1
Del 10 2
InsT 11 2
Rev 12 1
RepG 13 1
RepT 14 1
TGATAGGTGATAGATATGATTGATAGATGATAGAAGATTGATAGATGATAGGGATATCACGTAGTCCCTAGCTCTTGGCGCTGGATGGGGCGGACGGTAAGGGAAATCGACCGTTGATAGTCCAAATTCGGTCGTATGATAGAAATTTCGAATGGAAATTCTGATACATAGGTGATAGTAGATGTAAGATGATAGATGATAGATAGATAGATGATAGACAGATTGATAGATGATAGAGAGA
Encoding by Huffman method
First, find each segment probability:
Segment Index repetitions
probability
A 0 28 28/119
T 1 19 19/119
C 2 14 14/119
G 3 25 25/119
TGATAG 4 14 14/119
ATATGA 5 3 3/119
AAATTCG
6 4 4/119
GGTAA 7 2 2/119
GGCGC 8 2 2/119
RepC 9 1 1/119
Del 10 2 2/119
InsT 11 2 2/119
Rev 12 1 1/119
RepG 13 1 1/119
RepT 14 1 1/119
No. of segments = 119
Encoding by Huffman method
Arrange the segments in
non-decreasing order according to its probability.
14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
Build Huffman Coding Tree
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
66/119
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
66/119
119/119
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
66/119
119/119
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
66/119
119/119
Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
66/119
119/119
Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).
10
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
66/119
119/119
Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).
10
1
0
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
66/119
119/119
Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).
10
1
0
1
0
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
66/119
119/119
Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).
10
1
0
1
01
0
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
66/119
119/119
Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).
10
1
0
1
01
0
1
0
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
66/119
119/119
Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).
10
1
0
1
01
0
1
01
0
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
66/119
119/119
Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).
10
1
0
1
01
0
1
01
0
0
1
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
66/119
119/119
Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).
10
1
0
1
01
0
1
01
0
0
1
10
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
66/119
119/119
Next, beginning from the root and backing to the leaves, give each branch with the small value the value (0) and the large the value (1).
1
1
11
11 1
11
1
1
1
1
0
0
0
0
0
00
0
0
0
0
0
00
1
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
66/119
119/119
Finally, encode the segments via reading its code from the root to its leaf.
1
1
11
11 1
11
1
1
1
1
0
0
0
0
0
00
0
0
0
0
0
00
1
0
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
66/119
119/119
Code(0)=code (A)=10
1
1
11
11 1
11
1
1
1
1
0
0
0
0
00
0
0
0
0
0
00
1
0
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
66/119
119/119
Code(3)=code(G)=00
1
1
11
11 1
11
1
1
1
1
0
0
0
0
00
0
0
0
0
0
00
1
Code(0)=code (A)=10
1
Encoding by Huffman method14131291110875642130
1/1191/1191/1191/1192/1192/1192/1192/1193/1194/11914/11914/11919/11925/11928/119
2/119
2/119
4/119
4/119
4/119
7/119
8/119
11/119
19/119
28/119
38/119
53/119
66/119
119/119
Code(9)=code(RepC)=1100111
1
11
11 1
11
1
1
1
1
00
0
0
0
00
0
0
0
0
0
00
1Code(3)=code(G)=00
Code(0)=code (A)=10
Encoding by Huffman methodSegment Index Repetition
sProbability
Code
A 0 28 28/119 1 0
T 1 19 19/119 1 1 1
C 2 14 14/119 0 1 1
G 3 25 25/119 0 0
TGATAG 4 14 14/119 0 1 0
ATATGA 5 3 3/119 1 1 0 1 1 0
AAATTCG 6 4 4/119 1 1 0 1 1 1
GGTAA 7 2 2/119 1 1 0 1 0 1
GGCGC 8 2 2/119 1 1 0 1 0 0
RepC 9 1 1/119 1 1 0 0 1 1 1
Del 10 2 2/119 1 1 0 0 0 1
InsT 11 2 2/119 1 1 0 0 0 0
Rev 12 1 1/119 1 1 0 0 1 1 0
RepG 13 1 1/119 1 1 0 0 1 0 1
RepT 14 1 1/119 1 1 0 0 1 0 0
The final reference table is:
Keep in mind that only the segments and the codes are important for the decoder.
Encoding by Huffman method
The previous coding satisfy both prefix property and the information theory in that :• There is no code given for a segment is a prefix in an other segment code.•The shortest codes given to segments that are more frequent while long ones assigned to those which are less frequent.
Thank you