porting and adapting applications to arm’s sve€¦ · §complex control flow ... §sve supports...
TRANSCRIPT
![Page 1: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/1.jpg)
11 1University of Michigan
PORTING AND ADAPTING APPLICATIONS TO ARM’S SVEJonathan Beaumont, Dong-hyeon Park, Subhankar Pal, Trevor Mudge
![Page 2: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/2.jpg)
22 2University of Michigan
Introduction§ Scalable Vector Extension (SVE)
§ Extends Advanced SIMD (NEON) beyond 128 bit vectors
§ NEON designed for DSP, media codecs etc.§ Fixed vector sizes (128 bits)§ Simple control flow§ Regular, contiguous data structures
§ SVE extends to HPC applications§ Longer, per-implementation vector sizes (128-2048 bits)§ Complex control flow§ Irregular data structures
![Page 3: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/3.jpg)
33 3University of Michigan
Scalable vectors§ Different PPA restrictions require distinct hardware vector
lengths (e.g. mobile processors versus server hardware)
§ SVE enables hardware implementations to choose vector length multiple of 128 bits (up to 2048)§ Program code is agnostic; hardware manages automatically
ld1w z1.s, p1/z, [x1, x2, lsl #2] ; load VL sized vector from memory location [x1+4*x2]
incw x2 ; increment x2 counter by VLwhilelt p1.s, x2, x3 ; loop while x2 < x3 b.mi loop
![Page 4: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/4.jpg)
44 4University of Michigan
Per-lane predication§ Hardware dynamically masks inactive lanes
§ Support if/else statements
§ Handles scalar loops which do not iterate an exact multiple of vector length§ No fix-up code or strip-mining
while(i < end) { if(check[i]) //do stuff i++;}
whilelt p1.s, w11, w12b.pl end_looploop: ld1w z1.s, p1/z, [x4, x11, lsl #2] cmpeq p2.s, p1/z, z1.s #0 //do stuff [p2/z] incw x11 whilelt p1.s, w11, w12 b.mi loopend_loop:
![Page 5: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/5.jpg)
55 5University of Michigan
Vector load-stores§ HPC data structures often use complex data structures using
pointers§ SVE supports gather-load and scatter-store instructions
§ Allows indirect-access to non-contiguous memory arrays within a single instruction
int new_val = arr[idxs[i]] + 1;ld1w z2.s, p1/z, [x5, x11, lsl #2]ld1w z3.s, p1/z, [x1, z2.s, uxtw #2]fadda s13, p1, s13, 1
![Page 6: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/6.jpg)
66 6University of Michigan
Horizontal reductions§ Dependencies between elements in a vector can be resolved
with special instructions§ Summation, minimum, maximum, and logical reductions supported in
single instructions
Example Sparse Matrix-Vector Multiplication:scvtf s13, xzr ; sum = 0loop: ld1w z1.s, p1/z, [x4, x11, lsl #2] ; ld vals[i] ld1w z2.s, p1/z, [x5, x11, lsl #2] ; ld cols[i] ld1w z3.s, p1/z, [x1, z2.s, uxtw #2] ; ld vec[cols[i]] fmul z1.s, p1/m, z1.s, z3.s ; vals[i] * vec[cols[i]] fadda s13, p1, s13, z1.s ; sum += sum(z1) incw x11 whilelt p1.s, w11, w12 b.mi loopend_loop:
![Page 7: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/7.jpg)
77 7University of Michigan
Target Workloads§ Many scientific workloads make use of vector operations
through linear algebra subroutines
§ These workloads have wide range of sparsity in their structures, which affects§ Control flow regularity§ Memory access patterns
§ We look at two case studies with complementary sparsity:
![Page 8: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/8.jpg)
88 8University of Michigan
Target Workloads§ Genetic sequence alignment
§ Dense vectors§ Fine grained parallelism§ Moderate control irregularity§ Moderate memory irregularity
§ Graph analytics§ Extremely sparse (1 in 106) matrices§ Emphasis on effective data movement over minimizing computation
§ Employ outer-product on matrix multiplications
§ Little control divergence§ Significant memory irregularity
![Page 9: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/9.jpg)
99 9University of Michigan
Sequence Alignment§ Before analysis can be performed on genome fragments, they
must be aligned to a reference gene
§ Multiple alignments corresponding to various insertions, deletions, and offsets must be evaluated, leading to a dense, vectorized workload
§ We evaluate the Smith-Waterman Algorithm
referencegene
![Page 10: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/10.jpg)
1010 10University of Michigan
Smith-Waterman Algorithm§ Local sequence alignment algorithm developed in 1981§ Smaller amount of parallelism (100s of elements or less)
Local sequence alignment algorithm developed in 1981
Inputs:
Output:
ReferenceSequence
QuerySequence
AlignmentLocation&Score
ScoringMatrixConstruction
MatrixBacktracking
Smith-Waterman
![Page 11: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/11.jpg)
1111 11University of Michigan
𝑯 𝒎,𝒏 = 𝑚𝑎𝑥 )𝑬(𝒎,𝒏)𝑭(𝒎,𝒏)
𝑯 𝒎− 𝟏, 𝒏 − 𝟏 + 𝑺 𝒂𝒎, 𝒃𝒏-- A C A C A A-- 0 0 0 0 0 0 0A 0G 0C 0A 0
………………
… … …
Scoring ReferenceSequence
Que
rySeq
uence
𝐸 𝑚, 𝑛 = 𝑚𝑎𝑥 6𝐻 𝑚, 𝑛 − 1 − 𝑔:𝐸 𝑚, 𝑛 − 1 − 𝑔;
𝐹 𝑚, 𝑛 = 𝑚𝑎𝑥 6𝐻 𝑚 − 1, 𝑛 − 𝑔:𝐹 𝑚 − 1, 𝑛 − 𝑔;
Scoring Matrix Construction
![Page 12: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/12.jpg)
1212 12University of Michigan
Scoring Matrix Construction
-- A C A C A A-- 0 0 0 0 0 0 0A 0 2 1 2 1 2 2G 0 1 1 1 1 1 1C 0 0 3 2 1A 0 2 52
2
………………
… … …
ReferenceSequence
Que
rySeq
uence
Scoring
𝑯 𝒎,𝒏 = 𝑚𝑎𝑥 )𝑬(𝒎,𝒏)𝑭(𝒎,𝒏)
𝑯 𝒎− 𝟏, 𝒏 − 𝟏 + 𝑺 𝒂𝒎, 𝒃𝒏
𝐸 𝑚, 𝑛 = 𝑚𝑎𝑥 6𝐻 𝑚, 𝑛 − 1 − 𝑔:𝐸 𝑚, 𝑛 − 1 − 𝑔;
𝐹 𝑚, 𝑛 = 𝑚𝑎𝑥 6𝐻 𝑚 − 1, 𝑛 − 𝑔:𝐹 𝑚 − 1, 𝑛 − 𝑔;
![Page 13: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/13.jpg)
1313 13University of Michigan
-- A C A C A A-- 0 0 0 0 0 0 0A 0 2 1 2 1 2 2G 0 1 1 1 1 1 1C 0 0 3 2 1A 0 2 5
………………
… … …
ReferenceSequence
Que
rySeq
uence
Scoring
𝐻 3,4 =
max 𝐸 3,4 = 2 − 1𝐹 3,4 = 2 − 1
𝟑 + 𝟐 = 𝟓
𝑆 𝑎, 𝑏 = +2Gappenalty=1
𝐻 3,4 = 5
Given:
S(A,A)=+2
Scoring Matrix Construction
2
2
![Page 14: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/14.jpg)
1414 14University of Michigan
Findsthebestlocalalignmentfromthescoringmatrix
Step1.
Searchthroughthematrixandfindtheentrywiththelargestscore
Backtracking
maxentry1
Que
rySeq
uence
ReferenceSequenceBacktracking
![Page 15: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/15.jpg)
1515 15University of Michigan
Findsthebestlocalalignmentfromthescoringmatrix
Step2.
Checktheadjacententriesforthenextlargestscore
Movetotheentrywiththelargestscoreandcontinuethepath
Backtracking
maxentry1
Que
rySeq
uence
ReferenceSequenceBacktracking
Traversebackthroughthelargestscore
2
![Page 16: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/16.jpg)
1616 16University of Michigan
Findsthebestlocalalignmentfromthescoringmatrix
Step3.
Gettheresultingalignment
PathDirection
Alignment
Horizontal DeletionVertical InsertionDiagonal Match
Backtracking
maxentry1
Que
rySeq
uence
ReferenceSequenceBacktracking
Traversebackthroughthelargestscore
2
![Page 17: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/17.jpg)
1717 17University of Michigan
Findsthebestlocalalignmentfromthescoringmatrix
Step3.
Gettheresultingalignment
3Reference: A-CAC
Query: AGCAC
Insertion
AlignmentScore:7
Backtracking
maxentry1
Que
rySeq
uence
ReferenceSequenceBacktracking
Traversebackthroughthelargestscore
2
![Page 18: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/18.jpg)
1818 18University of Michigan
Smith-Waterman Vectorization
ReferenceQuery
L
L
X1000samples
BATCH:
SLICED:
…VL
SVE[0]
SVE[1]…
SVE[VL-1]
SVE[0] SVE[0] … SVE[VL-1]
X(1000/VL)iterations
X1000iterations
𝑉𝐿 =SIMD_WIDTH
32bits
AlignmentLocation&Score
![Page 19: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/19.jpg)
1919 19University of Michigan
Batch Smith-Waterman
ReferenceSequence
QuerySequencesSampled
0 0 0 0 0
Query0Reference0
SVE[0]
0 0 0 0 0
Query1Reference1
SVE[1]
0 0 0 0 0
QueryKReferenceK
SVE[K]
……✓No data dependencies between lanes✘ Divergent memory access✘ More memory bandwidth needed
![Page 20: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/20.jpg)
2020 20University of Michigan
Sliced Smith-Waterman
-- A C A C A G T A C A--AGCA
………………… …
ReferenceSequence
Que
rySeq
uence
… … … … … … … … …
…
VL=K
SVE[0] SVE[1] SVE[K-1]
SVE[0] SVE[1] SVE[2]
1
EH
F
𝑯 𝒎,𝒏 = 𝑚𝑎𝑥 )𝑬(𝒎,𝒏)𝑭(𝒎,𝒏)
𝑯 𝒎− 𝟏, 𝒏 − 𝟏 + 𝑺 𝒂𝒎, 𝒃𝒏
InitialCalculationofH(m,n)
![Page 21: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/21.jpg)
2121 21University of Michigan
Sliced Smith-Waterman
-- A C A C A G T A C A--AGCA
………………… …
ReferenceSequence
Que
rySeq
uence
… … … … … … … … …
…
VL=K
SVE[0] SVE[1] SVE[K-1]
SVE[0] SVE[1] SVE[2]
1
EH
F
InitialCalculationofH(m,n)
horizontaldependenciesbetweenslicesnotaccounted
ValueofFneedtobere-calculated
![Page 22: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/22.jpg)
2222 22University of Michigan
Sliced Smith-Waterman
-- A C A C A G T A C A--AGCA
………………… …
ReferenceSequence
Que
rySeq
uence
… … … … … … … … …
…
VL=K
SVE[0] SVE[1] SVE[K-1]
SVE[0] SVE[1] SVE[2]
1 InitialCalculationofH(m,n)
ResolveDependencies2
𝐹 𝑚, 𝑛 = 𝑚𝑎𝑥 6𝐻 𝑚 − 1, 𝑛 − 𝑔:𝐹 𝑚 − 1, 𝑛 − 𝑔;
𝑯 𝒎,𝒏 = max 𝑭 𝒎,𝒏 , 𝑯 𝒎,𝒏
Costlyresolutionasoneneedtotraversethroughtheentirerow
![Page 23: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/23.jpg)
2323 23University of Michigan
Sliced Smith-Waterman
-- A C A C A G T A C A--AGCA
………………… …
ReferenceSequence
Que
rySeq
uence
… … … … … … … … …
SVE[0] SVE[1] SVE[2]
1 InitialCalculationofH(m,n)
✓Smaller memory footprint✓Coalesced memory access✘ Inter-lane dependencies
![Page 24: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/24.jpg)
2424 24University of Michigan
§ Custom implementations run on modified gem5 setup supporting SVE instruction set
Experimental Setup
§ Current gem5 model serializes memory accesses across cache lines
§ 25 to 400 sequence length
![Page 25: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/25.jpg)
2525 25University of Michigan
0
0.5
1
1.5
2
2.5
25 50 100 200 400
Instructions
SequenceSize
NormalizedInstructionsExecuted
cpu batch_neon sliced_neon batch_sve_128bit sliced_sve_128bit
Performance
1 1 1 1 11.5 1.5 1.4 1.4 1.30.2 0.4 0.8 1.4 2.1
9.4 9.7 8.2 7.8 7.53.3
9.0
16.7
28.1
38.2
05
1015202530354045
25 50 100 200 400
Speedu
p
SequenceLength
AlignmentTimeSpeedupoverBaselineCPU
cpu batch_neon sliced_neon batch_sve_128bit sliced_sve_128bit
![Page 26: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/26.jpg)
2626 26University of Michigan
0
2
4
6
8
10
12
14
16
18
20
L025 L050 L100 L200 L400 L800
Mem
oryCo
ntrollerB
andw
idth(M
B/s)
SequenceLength
MemoryBandwidthofSlicedS-W:SVE256-bit,64kBL1D
read_bw write_bw
0
500
1000
1500
2000
2500
3000
3500
4000
L025 L050 L100 L200 L400 L800
Mem
oryCo
ntrollerB
andw
idth(M
B/s)
SequenceLength
MemoryBandwidthofBatchS-W:SVE256-bit,64kBL1D
read_bw write_bw
Batch vs Sliced: Memory Bandwidth§ Sliced reduces memory bandwidth
![Page 27: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/27.jpg)
2727 27University of Michigan
0
200
400
600
800
1000
1200
1400
1600
25 50 100 200 400 800
Alignm
entT
ime(m
s)
SequenceLength
ExecutionTimeforSlicedSmith-Watermanwith1000Sequences64kBL1DCache
128-bit 256-bit 512-bit 1024-bit
Sliced Smith-Waterman§ Larger vector length only beneficial for longer sequences
![Page 28: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/28.jpg)
2828 28University of Michigan
Memory coalescing§ Short bursts of memory divergence is a problem in our current
implementation§ While work is being done to implement scatter-gather
functionality, we believe more can be done to exploit memory level parallelism
§ We’ve demonstrated appreciable speedup on GPUs by coalescing not just across lanes in a vector, but across instructions as well
§ We believe this can be exploited to a lesser degree on SIMD pipelines as well
![Page 29: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/29.jpg)
2929 29University of Michigan
Hazards in Benchmarks
0
5
10
15
20
25
30Cachelinesperload/store
Waitingloads/stores
29MemoryDivergent Bandwidth-Limited Cache-Limited"WarpPool: sharing requests with inter-warp coalescing for throughput processors." MICRO 2015.
![Page 30: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/30.jpg)
3030 30University of Michigan
Intra-WarpCoalescer
Intra-WarpCoalescer Inter-Warp
CoalescerWarp
Scheduler
WarpScheduler L1
Intra-WarpCoalescers
Inter-WarpQueues
SelectionLogic
L1
Memory Coalesing
"WarpPool: sharing requests with inter-warp coalescing for throughput processors." MICRO 2015.
![Page 31: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/31.jpg)
3131 31University of Michigan
0
0.5
1
1.5
2
Speedu
p(x)
8-waybankedcache MRPB CustomCoalescer
MemoryDivergent Bandwidth-Limited Cache-Limited
3.172.35 5.16
1.38x
[1]MRPB:Memoryrequestprioritizationformassivelyparallelprocessors:HPCA2014
Coalescing Speedup on GPUs
![Page 32: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/32.jpg)
3232 32University of Michigan
Conclusion§ SVE instructions are well tailored to significantly reduce code
size and execution overhead of HPC applications§ For fine-grained applications such as genomics, lack of
scalable memory resources result in diminishing returns for increased vector size
§ Future focus on improving memory coalescing may be the key to better performance
![Page 33: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/33.jpg)
3333 33University of Michigan
Questions?
![Page 34: PORTING AND ADAPTING APPLICATIONS TO ARM’S SVE€¦ · §Complex control flow ... §SVE supports gather-load and scatter-store instructions §Allows indirect-access to non-contiguous](https://reader033.vdocuments.mx/reader033/viewer/2022051919/600be5bf2c962e3c303e2b67/html5/thumbnails/34.jpg)
3434 34University of Michigan
1
10
100
1000
10000
25 50 100 200 400 800
Alignm
entT
ime(m
s)
SequenceLength(bps)
ComparisonofDifferentSmith-WatermanImplementationfor256-bitSVE
Batch Slicedw/oBypass Sliced
0.27 0.751.48
4.24
7.57
12.19
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
L025 L050 L100 L200 L400 L800
SpeedupofChoosingSlicedoverBatch256-bitSVEand64kBL1DCache
Batch vs Sliced: Execution Time§ Batch performs better for short sequences, sliced performs
better for long