accelerating genomic sequence alignment workload with ... · l025 l050 l100 l200 l400 s) sequence...
TRANSCRIPT
![Page 1: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/1.jpg)
Accelerating Genomic Sequence Alignment Workload with Scalable Vector Architecture
University of Michigan, Ann Arbor
Dong-hyeon Park, Jon Beaumont, Trevor Mudge
![Page 2: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/2.jpg)
“Human Genome Project”, 2004
Weeks~$3 billion
~13 hours<$10,000
Past Present
“SpeedSeq”, 2015
DNA Storage
Portable Sequencer
Future Applications
Forensics
On-DemandDiagnosis
source: National Institute of Health
Genomics
Human Genome:3.2 billion base pairs
Need to sample at 30-50x coverage
![Page 3: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/3.jpg)
Read/ExtractSequences
• Reading fragment samples of whole genome
• Signal/Image processing
Analysis• Identifying gene variants and
abnormalities• Pattern matching, HMM, DNN
SequenceAlignment
• Matching overlaps across multiple sequences
• Dynamic vs heuristic algorithm
reference gene
Assembly• Reconstructing the original
sequence• de-novo vs mapping assembly
reconstructed sequence
Whole Genome Sequencing Pipeline
10-20k in length
![Page 4: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/4.jpg)
Target Architecture:
Scalable Vector Extension (SVE)
![Page 5: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/5.jpg)
ARM’s Scalable Vector Extension (SVE)
• Designed to complement existing SIMD architecture (NEON)
• Key Features:• Scalable Vector Length (128 - 2048-bits)• Per-lane Predication (32 SIMD Reg. + 16 Predicate Reg.)• Gather-load and scatter-store• Horizontal vector operations
Vector Length Agnostic Code
![Page 6: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/6.jpg)
ARM’s Scalable Vector Extension (SVE)
• Genomic sequences are sampled at different lengths depending on the device used for sampling:• Illumina HiSeq System: 30-300 bps• Sanger 3730xl: 400-900 bps
Vector-Length Agnostic Code can be used to Dynamically Choose
the Optimal SIMD Width
![Page 7: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/7.jpg)
Target Algorithm:
Smith-Waterman Sequence Alignment
![Page 8: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/8.jpg)
Smith-Waterman Algorithm
Local sequence alignment algorithm developed in 1981
Reference Sequence
Query Sequence
Inputs:
Output:
Alignment Location & Score
Scoring Matrix Construction
Matrix Backtracking
Smith-Waterman
![Page 9: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/9.jpg)
-- A C A C A A
-- 0 0 0 0 0 0 0
A 0
G 0
C 0
A 0
………………
… … …
Scoring
Scoring Matrix Construction
Reference Sequence
Qu
ery
Se
qu
en
ce
𝐻 𝑚, 𝑛 = 𝑚𝑎𝑥 ቐ
𝐸(𝑚, 𝑛)𝐹(𝑚, 𝑛)
𝐻 𝑚 − 1, 𝑛 − 1 + 𝑆 𝑎𝑚 , 𝑏𝑛
𝐸 𝑚, 𝑛 = 𝑚𝑎𝑥 ቊ𝐻 𝑚, 𝑛 − 1 − 𝑔𝑜𝐸 𝑚, 𝑛 − 1 − 𝑔𝑒
𝐹 𝑚, 𝑛 = 𝑚𝑎𝑥 ቊ𝐻 𝑚 − 1, 𝑛 − 𝑔𝑜𝐹 𝑚 − 1, 𝑛 − 𝑔𝑒
![Page 10: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/10.jpg)
-- A C A C A A
-- 0 0 0 0 0 0 0
A 0 2 1 2 1 2 2
G 0 1 1 1 1 1 1
C 0 0 3 2 1
A 0 2 52
2
………………
… … …
Reference Sequence
Qu
ery
Se
qu
en
ce
Scoring
Scoring Matrix Construction
𝑯 𝒎,𝒏 = 𝑚𝑎𝑥 ቐ
𝑬(𝒎,𝒏)𝑭(𝒎,𝒏)
𝑯 𝒎− 𝟏, 𝒏 − 𝟏 + 𝑺 𝒂𝒎, 𝒃𝒏
𝐸 𝑚, 𝑛 = 𝑚𝑎𝑥 ቊ𝐻 𝑚, 𝑛 − 1 − 𝑔𝑜𝐸 𝑚, 𝑛 − 1 − 𝑔𝑒
𝐹 𝑚, 𝑛 = 𝑚𝑎𝑥 ቊ𝐻 𝑚 − 1, 𝑛 − 𝑔𝑜𝐹 𝑚 − 1, 𝑛 − 𝑔𝑒
![Page 11: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/11.jpg)
-- A C A C A A
-- 0 0 0 0 0 0 0
A 0 2 1 2 1 2 2
G 0 1 1 1 1 1 1
C 0 0 3 2 3 2 1
A 0 2 2 5 4 5 4
C 0 1 4 4 7 6 6
max entry1
Qu
ery
Se
qu
ence
Reference SequenceBacktracking
BacktrackingFinds the best local alignment from the scoring matrix
Step 2.
Check the adjacent entries for the next largest score
Move to the entry with the largest score and continue the path
Traverse back through the largest score
2
![Page 12: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/12.jpg)
-- A C A C A A
-- 0 0 0 0 0 0 0
A 0 2 1 2 1 2 2
G 0 1 1 1 1 1 1
C 0 0 3 2 3 2 1
A 0 2 2 5 4 5 4
C 0 1 4 4 7 6 6
max entry1
Qu
ery
Se
qu
ence
Reference SequenceBacktracking
BacktrackingFinds the best local alignment from the scoring matrix
Traverse back through the largest score
2
Step 3.
Get the resulting alignment
Path Direction Alignment
Horizontal Deletion
Vertical Insertion
Diagonal Match
![Page 13: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/13.jpg)
-- A C A C A A
-- 0 0 0 0 0 0 0
A 0 2 1 2 1 2 2
G 0 1 1 1 1 1 1
C 0 0 3 2 3 2 1
A 0 2 2 5 4 5 4
C 0 1 4 4 7 6 6
max entry1
Qu
ery
Se
qu
ence
Reference SequenceBacktracking
BacktrackingFinds the best local alignment from the scoring matrix
Traverse back through the largest score
2
Step 3.
Get the resulting alignment
3Reference: A-CAC
Query: AGCAC
Insertion
Alignment Score: 7
![Page 14: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/14.jpg)
Reference
Query
L
L
Alignment Location & Score
Vectorization
[0][1]
[2][3]
Wavefront:
[2] Wozniak et al
[3] Farrar
[4] Rognes
(striped)
![Page 15: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/15.jpg)
Batch Smith-Waterman
Reference Sequence
Query Sequences Sampled
0 0 0 0 0
Query 0Reference 0
SVE[0]
0 0 0 0 0
Query 1Reference 1
SVE[1]
0 0 0 0 0
Query KReference K
SVE[K]
……[4] Rognes
![Page 16: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/16.jpg)
-- A C A C A G T A C A
--
A
G
C
A
………………
… …
Reference Sequence
Qu
ery
Se
qu
ence
Sliced Smith-Waterman
… … … … … … … … …
…
VL = K
SVE[0] SVE[1] SVE[K-1]
SVE[0] SVE[1] SVE[2]
1
EH
F
𝑯 𝒎,𝒏 = 𝑚𝑎𝑥 ቐ
𝑬(𝒎,𝒏)𝑭(𝒎,𝒏)
𝑯 𝒎 − 𝟏, 𝒏 − 𝟏 + 𝑺 𝒂𝒎, 𝒃𝒏
Initial Calculation of H(m,n)
[3] Farrar
(striped)
![Page 17: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/17.jpg)
-- A C A C A G T A C A
--
A
G
C
A
………………
… …
Reference Sequence
Qu
ery
Se
qu
ence
Sliced Smith-Waterman
… … … … … … … … …
…
VL = K
SVE[0] SVE[1] SVE[K-1]
SVE[0] SVE[1] SVE[2]
1
horizontal dependenciesbetween slices not accounted
EH
F
Value of F need to be re-calculated
[3] Farrar
(striped)
![Page 18: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/18.jpg)
-- A C A C A G T A C A
--
A
G
C
A
………………
… …
Reference Sequence
Qu
ery
Se
qu
ence
Sliced Smith-Waterman
… … … … … … … … …
…
VL = K
SVE[0] SVE[1] SVE[K-1]
SVE[0] SVE[1] SVE[2]
Resolve Dependencies2
𝐹 𝑚, 𝑛 = 𝑚𝑎𝑥 ቊ𝐻 𝑚 − 1, 𝑛 − 𝑔𝑜𝐹 𝑚 − 1, 𝑛 − 𝑔𝑒
𝑯 𝒎,𝒏 = max 𝑭 𝒎,𝒏 , 𝑯 𝒎, 𝒏
F F
[3] Farrar
(striped)
![Page 19: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/19.jpg)
Wavefront Smith-Waterman
EH
F
All dependency comes from previous execution
-- A C A C A G T A C A
--
A
G
C
A
………………
… … … … … … … … … … …
1
0
F,E
Reference Sequence
Qu
ery
Se
qu
en
ce
![Page 20: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/20.jpg)
Wavefront Smith-Waterman
EH
F
All dependency comes from previous execution
-- A C A C A G T A C A
-- H
A
G
C
A
………………
… … … … … … … … … … …
1
2
0
F,E
F,E
Reference Sequence
Qu
ery
Se
qu
en
ce
[2] Wozniak et al
![Page 21: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/21.jpg)
Wavefront Smith-Waterman
EH
F
All dependency comes from previous execution
-- A C A C A G T A C A
-- H
A H
G
C
A
………………
… … … … … … … … … … …
1
2
0
3
F,E
F,E
F,E
Reference Sequence
Qu
ery
Se
qu
en
ce
[2] Wozniak et al
![Page 22: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/22.jpg)
Wavefront Smith-Waterman
EH
F
All dependency comes from previous execution
-- A C A C A G T A C A
--
A
G
C
A
………………
… … … … … … … … … … …1
2
0
4
3
F
F,E
F,E
F,E
F,EH
H
H
H
Reference Sequence
Qu
ery
Se
qu
en
ce
More book-keeping overhead than other algorithms:• Keep track of H values of two prev. iterations• F and E values from prev. iteration
[2] Wozniak et al
![Page 23: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/23.jpg)
Experimental Evaluation:
Smith-Waterman on gem5 w/ SVE
![Page 24: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/24.jpg)
Experimental Setup
Gem5 Simulator w/ ARM SVE Simulation
![Page 25: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/25.jpg)
Experimental Setup
Application:Smith-Waterman – Batch, Sliced, and Wavefront
• Reference : 25-400 bps samples from E. Coli 536 Gene (4.9 Mbps)
• Query :1000 x 25-400 bps samples through WGSim
![Page 26: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/26.jpg)
Advantage of SVE over Traditional System• CPU, NEON implementation written in C. SVE hand-written in assembly.• SVE outperforms both CPU and NEON implementations by at least 3x• Batch, Sliced and Wavefront used 32-bit, 16-bit and 64-bit vectors respectively.
1 1 1 1 11.5 1.5 1.4 1.4 1.30.2 0.4 0.8 1.4 2.1
10.8 9.7 8.2 7.8 7.5
3.3
9.0
16.7
28.1
38.2
2.6
6.9 5.6 5.2 5.6
0
5
10
15
20
25
30
35
40
45
25 50 100 200 400
Spee
up
ove
r C
PU
Sequence Length
Alignment Time Speedup over Baseline CPUcpu batch_neon sliced_neon batch_sve_128bit sliced_sve_128bit wavefront_128bit
![Page 27: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/27.jpg)
Impact of Handwritten Assembly
1.85 1.25 1.59 1.691.00 1.00 1.00 1.00
3.82
5.775.09 5.384.73
8.64 8.98 9.24
6.15
11.91
14.34
15.95
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
L=25 L=50 L=100 L=200Spee
du
p o
ver
CP
U w
/ W
avef
ron
tA
lg.
Sequence Length (bps)
Speedup of Wavefront Algorithm
CPU (naïve) CPU (wav) CPU-ASSEM SVE-128bit SVE-256bit
• Hand-written assembly code of Wavefront Algorithm has 4-6x speedup over C code.
![Page 28: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/28.jpg)
Advantage of SVE over Traditional System• SVE reduces the instruction execution significantly compared to CPU or NEON
0
0.5
1
1.5
2
2.5
25 50 100 200 400
Inst
ruct
ion
Exe
cute
d
Co
mp
ared
to
CP
U
Sequence Length
Instructions Executed Compared to Baseline CPUcpu batch_neon batch_sve_128bit sliced_neon sliced_sve_128bit wavefront_128bit
![Page 29: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/29.jpg)
0
1000
2000
3000
4000
5000
6000
L025 L050 L100 L200 L400
Mem
. Co
ntr
olle
r B
W (
MB
/s)
Sequence Length
Memory Bandwidth of Batch S-W:SVE 512-bit, 64kB L1D
read_bw
write_bw
0
2
4
6
8
10
12
14
16
18
L025 L050 L100 L200 L400
Mem
. Co
ntr
olle
r B
W (
MB
/s)
Sequence Length
Memory Bandwidth of Sliced S-W: SVE 512-bit, 64kB L1D
read_bwwrite_bw
0
20
40
60
80
100
120
L025 L050 L100 L200 L400
Mem
. Co
ntr
olle
r B
W (
MB
/s)
Sequence Length
Memory Bandwidth of Wavefront S-W:SVE 512-bit, 64kB L1D
read_bwwrite_bw
Memory Bandwidth Comparison
• Sliced and Wavefront significantly reduce the memory bandwidth compared to the Batch algorithm
![Page 30: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/30.jpg)
Vector Scaling of Different Algorithms
1.0 1.0 1.0 1.01.00.7
0.40.2
1.0
1.6
2.42.6
0.0
0.5
1.0
1.5
2.0
2.5
3.0
128-bit 256-bit 512-bit 1024-bit
Spe
ed
up
ove
r 1
28
-bit
SV
E
SVE Vector Width
Vector Performance Scaling of Batch, Sliced, and Wavefront at Sequence Length of L=100
Batch Sliced Wavefront
1.0 1.0 0.9 0.81.0
1.4 1.3
0.3
1.0
1.8
3.03.3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
128-bit 256-bit 512-bit 1024-bit
Spe
ed
up
ove
r 1
28
-bit
SV
ESVE Vector Width
Vector Performance Scaling of Batch, Sliced and Wavefront at Sequence Length of L=400
Batch Sliced Wavefront
• Batch and Sliced show marginal improvement with increasing vector length• Difficult to keep up with increased memory demand• Need to resolve dependencies.
![Page 31: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/31.jpg)
1
10
100
1000
10000
25 50 100 200 400Exe
cuti
on
Tim
e -
Log
scal
e (
ms)
Sequence Length (base pairs)
Comparison of Different Smith-Waterman Vectorization 512-bit SVE and 64kB L1D Cache
Batch Sliced Waveform
Waveform SlicedBatch
Fixed HW Options: Batch vs Sliced vs Waveform @512-bit
Batch • Low overhead. • Poor scaling.• Fastest for short seq.
Waveform • Efficient use of vector Lanes• Fastest for medium seq.
Sliced • High overhead. • Execution bypassing• Fastest for long seq.
![Page 32: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/32.jpg)
HW with Variable Vector Length
• Given freedom to choose the hardware for each sequence length, we can establish a set of optimal algorithm-hardware pair.
Read Length Algorithm Vector LengthSpeedup Over
512-bit Wavefront
< 50 bps Batch 128-bit 2.77
50-100 bps Wavefront 1024-bit 1.03
100-400 bps Sliced 256-bit 1.23-3.06
![Page 33: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/33.jpg)
Conclusion
Smith-Waterman on SVE:
+ Select Optimal Vector Length & Algorithm depending on Input
+ Lower Instruction Footprint
• Improvements to memory controller can lead to improved performance
• Wavefront algorithm use 64-bit vectors due to limitations on gather-scatter instruction addressing.
![Page 34: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/34.jpg)
Key References[1] Smith TF, Waterman MS, “Identification of common molecular subsequences” J Mol Biol 147
[2] Wozniak A. “Using video-oriented instructions to speed up sequence comparison” Comput Appl Biosci. 1997
[3] Farrar M, “Striped Smith-Waterman speeds database searches six times over other SIMD implementations” Bioinformatics, Vol 23, Issue 2, 15 January 2007
[4] Rognes T, “Faster Smith-Waterman database searches with inter-sequence SIMD parallelization” Bioinformatics 2011
[5] Zhao M, Lee W, Garrison E., Marth G. “SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications”
[6] Li H, Durbin R. “Fast and accurate short read alignment with Burrows-Wheeler transform” Bioinformatics 25
[7] Steinfadt S. “SWAMPT+: Enhanced Smith-Waterman Search for Parallel Models”
![Page 35: Accelerating Genomic Sequence Alignment Workload with ... · L025 L050 L100 L200 L400 s) Sequence Length Memory Bandwidth of Batch S-W: SVE 512-bit, 64kB L1D read_bw write_bw 0 2](https://reader035.vdocuments.mx/reader035/viewer/2022070913/5fb4c77c14f6243d830d2302/html5/thumbnails/35.jpg)
Questions?