a proposed solution to the short read reassembly problem carl ebeling and corey olson

19
A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

Upload: logan-farrell

Post on 26-Mar-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

A Proposed Solution to the Short Read Reassembly Problem

Carl Ebeling and Corey Olson

Page 2: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

Outline

Background Indexing Solution Architecture

Page 3: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

Motivation

Solexa/Illumina and SOLiD ~billions of base pairs in hours

100s of millions of short reads (30-70 bp) read in parallel

Computational cost rising Needed: hardware solution to

improve speed and usability

Page 4: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

Background

Goal: quickly align millions of reads to the reference genome

Read errors and SNPs prevent simple indexing Solutions

Brute force comparison of all reads to reference Indexed-based using seeds Burroughs-Wheeler Transform

Page 5: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

Index Based Solution Reference Index Table (RIT)

Maps all seeds to positions in the reference

Read Position Table (RPT) Maps reads to regions in the

reference for comparison Smith Waterman Comparison

Stream reference genome into SW units for scoring of reads

Page 6: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

RIT Creation

212

18

3 617

36

219

0

113

CATGCTAT 65

MaskSeed CATGCTA

T

CATGCTAA

CATGCTAC

11101101011CAT_GC_TGAT

CATGCTAG

CATGCCGG

Note: first column is number of entries

Page 7: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

RPT Creation

1

312

18

1 6

219

14

0

32:63

RPT

Read 23

Mask

11101101011CAT_GC_T_AT

Seed ATACATTGCGTAATCG

0:31

64:95

CATGCTAT

23

23

96:127

212

18

3 617

36

219

65

0

113

CATGCTAT

RIT

CATGCTAA

CATGCTAC

CATGCTAG

CATGCCGG

128:159

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Page 8: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

Read Scoring

SW Unit

TAGTGTGATCGAA

123

312

18

23

1 6

219

14

0

32:63

RPT

0:31

64:95

96:127

128:159

Read #6:

Page 9: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

Buckets Buckets combine hits for a read along

the reference Reduces number of SW units required Optimal bucket length unknown

Page 10: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

Entries Per Location in RIT

N = number of base pairs in reference genome

k = characters in the seed (#1s in the mask)

Note: Each entry in RIT ~ 4 Bytes, 2^2k total locations, N entries

N=31,k=11: RIT = 2^31*2^2 = 8GB

N=32,k=14: RIT = 2^32*2^2 = 16GB

Page 11: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

Entries in RPT

R = number of reads

Seff = effective number of seeds per read

Ex: R=2^27, Seff=2: 2^20 * 2048 * 4 = 8GB

Page 12: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

Entries per Bucket

b = bucket size

Note: this determines the number of SW units required

Page 13: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

Architecture

Memory Required 8 GB for RIT, 8 GB for RPT Creation of RIT and RPT is random access

Access time can be masked with buffering and multiple memory banks

High bandwidth communication required between FPGAs

Page 14: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

RIT Creation Algorithm

1. Move to the next reference character2. Generate the next seed with the mask3. Using seed as address, open DRAM row

a) Read current array lengthb) Increment array length and write backc) Write reference position to array[length]

Page 15: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

Memory Distribution

RIT RITRIT RIT

RIT RITRIT RITAA..

AC..

AG..

AT..

CA..

CC..

CG..

CT..

RIT RIT

RIT RITTA..

TC..

TG..

TT..

RIT Distributed by Seed

RPTpart0

RPT Buckets Partitioned across memory modules by reads

RPTpart1

RPTpart2

RPTpart3

RPTpart4

RPTpart5

RPTpart6

RPTpart7

RPT

RPT

RPT

RPT

partn-4

partn-3

partn-2

partn-1

Page 16: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

RPT Creation Algorithm1. Clear the bucket set P in the FPGA assigned

to the read2. For each seed in the read

a) Using seed as address, read all reference positions from RIT

b) Add the current read to the bucket associated with each position

3. After all seeds in read, for each bucket in Pa) Using the reference position as address, read the

current array lengthb) Increment the array length and write backc) Write the read ID to array[length]

Page 17: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

Reassembly Process with Architecture

Reference streamed from host source Reads loaded from RPT

into SW units at start comparison point

Max score and location for each read recorded by SW unit at end comparison point

Page 18: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

Active SW Units at one time

Lr = Read Length

e = error window size

Page 19: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

Performance Estimates Construction of RIT = 16 seconds

Assuming 128MHz and process 1 reference character per clock

Construction of RPT = 10 minutes Assuming R=130M, LR=64, N=2^31,

k=14, 4 FPGAs Reassembly Phase = 16 seconds

Assuming 128MHz, N=2^31