a proposed solution to the short read reassembly problem carl ebeling and corey olson
TRANSCRIPT
![Page 1: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/1.jpg)
A Proposed Solution to the Short Read Reassembly Problem
Carl Ebeling and Corey Olson
![Page 2: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/2.jpg)
Outline
Background Indexing Solution Architecture
![Page 3: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/3.jpg)
Motivation
Solexa/Illumina and SOLiD ~billions of base pairs in hours
100s of millions of short reads (30-70 bp) read in parallel
Computational cost rising Needed: hardware solution to
improve speed and usability
![Page 4: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/4.jpg)
Background
Goal: quickly align millions of reads to the reference genome
Read errors and SNPs prevent simple indexing Solutions
Brute force comparison of all reads to reference Indexed-based using seeds Burroughs-Wheeler Transform
![Page 5: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/5.jpg)
Index Based Solution Reference Index Table (RIT)
Maps all seeds to positions in the reference
Read Position Table (RPT) Maps reads to regions in the
reference for comparison Smith Waterman Comparison
Stream reference genome into SW units for scoring of reads
![Page 6: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/6.jpg)
RIT Creation
212
18
3 617
36
219
0
113
CATGCTAT 65
MaskSeed CATGCTA
T
CATGCTAA
CATGCTAC
11101101011CAT_GC_TGAT
CATGCTAG
CATGCCGG
Note: first column is number of entries
![Page 7: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/7.jpg)
RPT Creation
1
312
18
1 6
219
14
0
32:63
RPT
Read 23
Mask
11101101011CAT_GC_T_AT
Seed ATACATTGCGTAATCG
0:31
64:95
CATGCTAT
23
23
96:127
212
18
3 617
36
219
65
0
113
CATGCTAT
RIT
CATGCTAA
CATGCTAC
CATGCTAG
CATGCCGG
128:159
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
![Page 8: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/8.jpg)
Read Scoring
SW Unit
TAGTGTGATCGAA
123
312
18
23
1 6
219
14
0
32:63
RPT
0:31
64:95
96:127
128:159
Read #6:
![Page 9: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/9.jpg)
Buckets Buckets combine hits for a read along
the reference Reduces number of SW units required Optimal bucket length unknown
![Page 10: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/10.jpg)
Entries Per Location in RIT
N = number of base pairs in reference genome
k = characters in the seed (#1s in the mask)
Note: Each entry in RIT ~ 4 Bytes, 2^2k total locations, N entries
N=31,k=11: RIT = 2^31*2^2 = 8GB
N=32,k=14: RIT = 2^32*2^2 = 16GB
![Page 11: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/11.jpg)
Entries in RPT
R = number of reads
Seff = effective number of seeds per read
Ex: R=2^27, Seff=2: 2^20 * 2048 * 4 = 8GB
![Page 12: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/12.jpg)
Entries per Bucket
b = bucket size
Note: this determines the number of SW units required
![Page 13: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/13.jpg)
Architecture
Memory Required 8 GB for RIT, 8 GB for RPT Creation of RIT and RPT is random access
Access time can be masked with buffering and multiple memory banks
High bandwidth communication required between FPGAs
![Page 14: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/14.jpg)
RIT Creation Algorithm
1. Move to the next reference character2. Generate the next seed with the mask3. Using seed as address, open DRAM row
a) Read current array lengthb) Increment array length and write backc) Write reference position to array[length]
![Page 15: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/15.jpg)
Memory Distribution
RIT RITRIT RIT
RIT RITRIT RITAA..
AC..
AG..
AT..
CA..
CC..
CG..
CT..
RIT RIT
RIT RITTA..
TC..
TG..
TT..
RIT Distributed by Seed
RPTpart0
RPT Buckets Partitioned across memory modules by reads
RPTpart1
RPTpart2
RPTpart3
RPTpart4
RPTpart5
RPTpart6
RPTpart7
RPT
RPT
RPT
RPT
partn-4
partn-3
partn-2
partn-1
![Page 16: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/16.jpg)
RPT Creation Algorithm1. Clear the bucket set P in the FPGA assigned
to the read2. For each seed in the read
a) Using seed as address, read all reference positions from RIT
b) Add the current read to the bucket associated with each position
3. After all seeds in read, for each bucket in Pa) Using the reference position as address, read the
current array lengthb) Increment the array length and write backc) Write the read ID to array[length]
![Page 17: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/17.jpg)
Reassembly Process with Architecture
Reference streamed from host source Reads loaded from RPT
into SW units at start comparison point
Max score and location for each read recorded by SW unit at end comparison point
![Page 18: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/18.jpg)
Active SW Units at one time
Lr = Read Length
e = error window size
![Page 19: A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson](https://reader036.vdocuments.mx/reader036/viewer/2022082805/5514142c550346d8488b5212/html5/thumbnails/19.jpg)
Performance Estimates Construction of RIT = 16 seconds
Assuming 128MHz and process 1 reference character per clock
Construction of RPT = 10 minutes Assuming R=130M, LR=64, N=2^31,
k=14, 4 FPGAs Reassembly Phase = 16 seconds
Assuming 128MHz, N=2^31