algorithm of ngs data

41
Speaker: Eric C.Y., LEE Advisor: I-Fang Chung 2011.Mar.21 1 Monday, March 21, 2011

Upload: eric-lee

Post on 12-Nov-2014

703 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Algorithm of NGS Data

Speaker: Eric C.Y., LEEAdvisor: I-Fang Chung

2011.Mar.21

1Monday, March 21, 2011

Page 2: Algorithm of NGS Data

Outline

• Motivation

• Workflow

• Result

• Conclusion

• My Comment

2Monday, March 21, 2011

Page 3: Algorithm of NGS Data

Motivation

• High throughput sequence technology play an important role in the life science now.

• Different high throughput sequence technologies are competing to be able to sequence an individual human genome for less than $1,000 within a few years.

2006.Mar.17 Vol.311 Science

3Monday, March 21, 2011

Page 4: Algorithm of NGS Data

Motivation

• The amount of data produced by HTS technologies creates significant bioinformatics challenge to understand, store and share data.

4Monday, March 21, 2011

Page 5: Algorithm of NGS Data

Workflow

Evaluate algorithms

Analysis datasets

Preliminary result

Golomb-RiceElias GammaMOVHuffman...

Dataset1Dataset2Dataset3...

For locationFor mismatch...

5Monday, March 21, 2011

Page 6: Algorithm of NGS Data

Coding Strategy

Claude Shannoon1916~2001

Optimal encoding of these integers from a compression standpoint depends on their distribution in order to assign shorter binary codes to more probable symbols.

~ Shannon’s Entropy Coding Theory

6Monday, March 21, 2011

Page 7: Algorithm of NGS Data

Encoding Strategies

• Fixed Codes

• Golomb-Rice Codes

• Elias Gamma Codes

• Monotone Value Codes

• Variable Codes

• Huffman Code

7Monday, March 21, 2011

Page 8: Algorithm of NGS Data

Golomb-Rice CodesSet m=10, and try to encode 42

Encoding of quotient partEncoding of quotient partq output bits

0 0

1 10

2 110

3 1110

4 11110

5 111110

6 1111110

.. ..

N <N repetitions of 1>

Encoding of remainder partEncoding of remainder partEncoding of remainder partr binary output bits

0 0000 000

1 0001 001

2 0010 010

3 0011 011

4 0100 100

5 0101 101

6 1100 1100

7 1101 1101

8 1110 1110

9 1111 1111n=42, n/m q=4, r=2

output is 111100108Monday, March 21, 2011

Page 9: Algorithm of NGS Data

Elias Gamma Codesnumber 2^n output

1 20+0 12 21+0 0103 21+1 0114 22+0 001005 22+1 001016 22+2 001107 22+3 001118 23+0 00010009 23+1 000100110 23+2 000101011 23+3 000101112 23+4 000110013 23+5 000110114 23+6 000111015 23+7 000111116 24+0 00001000017 24+1 000010001

42=25+10

Example

00000101010

9Monday, March 21, 2011

Page 10: Algorithm of NGS Data

MOV Codingnumber 2^n output

1 20+0 12 21+0 103 21+1 114 22+0 1005 22+1 1016 22+2 1107 22+3 1118 23+0 10009 23+1 100110 23+2 101011 23+3 101112 23+4 110013 23+5 110114 23+6 111015 23+7 111116 24+0 1000017 24+1 10001

Beginning with Elias Gamma code’s significant 1-bit.

Decode: 10001

{4bit}

24 + (0001)2

1710Monday, March 21, 2011

Page 11: Algorithm of NGS Data

Huffman Codes“this is an example of a huffman tree”

11Monday, March 21, 2011

Page 12: Algorithm of NGS Data

Workflow

Evaluate algorithms

Analysis datasets

Preliminary result

Golomb-RiceElias GammaMOVHuffman...

Dataset1Dataset2Dataset3...

For locationFor mismatch...

12Monday, March 21, 2011

Page 13: Algorithm of NGS Data

Dataset1

• Retrotransposon Ty3 insertion sites in the yeast genome.

• 6,439,584 reads in 19 bp.

• Highly Clustered.

• High degree of repetition.

• Most two substitutions.

232%

114%

054%

13Monday, March 21, 2011

Page 14: Algorithm of NGS Data

Dataset2

• In vivo binding site locations of the neuron-restrictive silencer factor (NRSF)in humans.

• Mapped to hg18.

• 1,697,990 reads in 25 bp.

• Most two substitutions.

26%1

18%

076%

14Monday, March 21, 2011

Page 15: Algorithm of NGS Data

Dataset2 Nucleotide Substitutions

15Monday, March 21, 2011

Page 16: Algorithm of NGS Data

Dataset3

• Corresponds to a full diploid human genome sequencing experiment for an Asian individual.

• Large dataset. Only mapped to chr.22.

• 31,118,531 reads. 30~40bp. 219%

120%

061%

16Monday, March 21, 2011

Page 17: Algorithm of NGS Data

Workflow

Evaluate algorithms

Analysis datasets

Preliminary result

Golomb-RiceElias GammaMOVHuffman...

Dataset1Dataset2Dataset3...

For locationFor mismatch...

17Monday, March 21, 2011

Page 18: Algorithm of NGS Data

Alignment Result Example

Bowtie

Name of read that aligned

Strand

Name of reference sequence occurs

0-bases offset into theforward reference strand

Read sequence

Read quality

Value of celing

Mismatch descriptors

18Monday, March 21, 2011

Page 19: Algorithm of NGS Data

Encoding Location Information

• Standalone: Encoding each column independently.

• Combine: Combining column of chromosome, strand and mismatch then compressing together.

19Monday, March 21, 2011

Page 20: Algorithm of NGS Data

Apply the Algorithms

• Elias Gamma (EG) Absolute

• Sequence can’t be sort.

• Apply to Dataset3.

20Monday, March 21, 2011

Page 21: Algorithm of NGS Data

Apply the Algorithms

• Elias Gamma Relative (REG)

• Sequence can be sort, compression performance much better.

• Sorting the location address using relative instead of absolute.

21Monday, March 21, 2011

Page 22: Algorithm of NGS Data

Apply the Algorithms

• Relative Elias Gamma Indexed (REG Indexed)

• Sorting and creating index file.

• Combine chromosome, strand, mismatches together. Compressing them by relative location.

• Can’t apply to dataset 3.

22Monday, March 21, 2011

Page 23: Algorithm of NGS Data

Apply the Algorithms

• Monotone Value (MOV)

• Based on chromosome and location, sorting the sequences.

• Coding the absolute address.

23Monday, March 21, 2011

Page 24: Algorithm of NGS Data

Apply the Algorithms

• Huffman codes

• Focused on “relative” start position.

• This algorithm has to storing the Huffman tree for decompression.

24Monday, March 21, 2011

Page 25: Algorithm of NGS Data

Comments for encoding location

• REG is suit for the three datasets.

• From dataset 1, using unique location of chromosome and counting the frequencies for coding. REG is an ideal solution for highly repetitive dataset.

• Huffman code it’s not good for dataset 1.

25Monday, March 21, 2011

Page 26: Algorithm of NGS Data

Encoding Mismatch Information

• Each read may contains 1 or 2 mismatch and has the nucleotide value.

• Using one line to record the mismatch information. If no mismatch leave the line blank.

26Monday, March 21, 2011

Page 27: Algorithm of NGS Data

Mismatches of Dataset2

Calculate the position from the end of the reads.

If the mismatch at 23

From start is 22.

10110

From end is 2.

10

27Monday, March 21, 2011

Page 28: Algorithm of NGS Data

Nucleotide Substitution• Using number instead of characters.

A: 651000001C: 671000011G: 711000111T: 841010100

A: 00 C:01 G:10 T:1128Monday, March 21, 2011

Page 29: Algorithm of NGS Data

Combining Location and Mismatch

19G

30A

34T

Count the frequencies,coding the location and mismatch together.

19G: 00001010110

19G: 10110

{ 11bit }

{5bit}

29Monday, March 21, 2011

Page 30: Algorithm of NGS Data

Final Encoding

• Dataset1: Mismatches dominates most of space, because of it already be sorted.

• Dataset2: Location is sparse, it dominates lots of storage.

• Dataset3: This dataset is balanced, because of it has full coverage of genome.

30Monday, March 21, 2011

Page 31: Algorithm of NGS Data

Implementation

• Based on REG indexed for location information and combined encoding for mismatch information.

• Pass1: Counting the mismatches.

• Pass2: Actual encoding.

31Monday, March 21, 2011

Page 32: Algorithm of NGS Data

Result

Original

Best Compression

GenCompress

gzip

bzip2

7zip

0 275,000,000 550,000,000 825,000,000 1,100,000,000

30,651,664

42,233,336

41,378,624

56,166,419

56,078,940

1,030,333,440

(bytes)Dataset1

32Monday, March 21, 2011

Page 33: Algorithm of NGS Data

Result

Original

Best Compression

GenCompress

gzip

bzip2

7zip

0 100000000 200000000 300000000 400000000

83,319,584

94,030,320

95,688,992

36,099,480

35,983,322

353,181,920

(bytes)Dataset2

33Monday, March 21, 2011

Page 34: Algorithm of NGS Data

Result

Original

Best Compression

GenCompress

gzip

bzip2

7zip

0 2250000000 4500000000 6750000000 9000000000

411,811,520

955,061,616

618,818,824

390,541,330

390,541,330

8,869,613,392

(bytes)Dataset3

34Monday, March 21, 2011

Page 35: Algorithm of NGS Data

Conclusion

• Any genome sequence can be used for mapping the reads.

• From the view of time consuming, GenCompress is worth to use.

35Monday, March 21, 2011

Page 36: Algorithm of NGS Data

Compression Time

Dataset1

Dataset2

Dataset3

0 125 250 375 500

447

77

107

422

20

78

70

13

10

111

5

20GenCompress gzipbzip2 7zip

(sec)

36Monday, March 21, 2011

Page 37: Algorithm of NGS Data

Decompression Time

Dataset1

Dataset2

Dataset3

0 15 30 45 60

21

2

4

53

4

7

13

1

2

15

1

2GenCompress gzipbzip2 7zip

(sec)

37Monday, March 21, 2011

Page 38: Algorithm of NGS Data

Conclusion• Hard drive is not expensive, the cost is the

bandwidth.

• Doesn’t consider the quality score.

• Read identifier is also important.

• Maybe mismatches are contaminants, de novo. Or the reference sequence is unfinished.

• Only consider the best match.

38Monday, March 21, 2011

Page 39: Algorithm of NGS Data

Conclusion• Huffman tree in dataset 1 and 2.

39Monday, March 21, 2011

Page 40: Algorithm of NGS Data

My Comments• They should open source.

• Hardware configuration.

Why RAID1?

40Monday, March 21, 2011

Page 41: Algorithm of NGS Data

Thanks for your attention!

41Monday, March 21, 2011