high frequency elastic seismic modeling on gpus without...
TRANSCRIPT
© 2014 Chevron
High Frequency Elastic Seismic Modeling
on GPUs Without Domain Decomposition
25-Mar-2014
Thor Johnsen and Alex Loddoch
Chevron
1
© 2014 Chevron
Goal
Out-of-core GPU implementation of an
elastic seismic wave propagation kernel
using host memory as bulk memory.
Pipelined approach by Etgen and O’Brien
(2007) of particular interest.
2
© 2014 Chevron
Out-of-core FDTD
3
© 2014 Chevron
Pipelined FDTD
4
© 2014 Chevron
Blocks
For convenience,
we choose to work
with blocks instead
of individual Y-Z
planes
We choose to use a
fixed block size of
L/2 planes, L is the
stencil length
5
© 2014 Chevron
Pseudo-Acoustic TTI Kernel
One PDE
– Non-energy preserving pseudo-acoustic wave propagation
– Transverse tilted Isotropy
– Variable density and Q attenuation
10th order in space, 2nd order in time
– ..but effective stencil length is 19!
– Most convenient block size is 16.
2 wavefields + 7 earth model attributes
– Need wavefields at 2 consecutive timesteps
– Earth model packed into 8 bytes
• On-the-fly rotation
• Quantization
– Need 24b of memory per cell
6
© 2014 Chevron 7
EM block 0 block 1
block 0 block 1
t=-1 t=-1
block 0 block 1
t=0 t=0
EM block 0
block 0
t=-1
block 0
t=0
Input
Host BufferP-Q
P-Q
GPU Buffers
P-Q
P-Q
P-Q
P-Q
P-Q
EM block 0 block 1 block 2
block 0 block 1 block 2
t=-1 t=-1 t=-1
block 0 block 1 block 2
t=0 t=0 t=0
EM block 0 block 1
block 0 block 1
t=-1 t=-1
block 0 block 1
t=0 t=0
P-Q
P-Q
Input
Host BufferP-Q
P-Q
GPU Buffers
P-Q
P-Q
P-Q
EM block 0 block 1 block 2 block 3
block 0 block 1 block 2 block 3
t=-1 t=-1 t=-1 t=-1
block 0 block 1 block 2 block 3
t=0 t=0 t=0 t=0
EM block 0 block 1 block 2
block 0 block 1 block 2
t=-1 t=-1 t=-1
block 0 block 1 block 2
t=0 t=0 t=0
block 0
t=1
P-Q
P-Q
Input
Compute
Host BufferP-Q
P-Q
GPU Buffers
P-Q
P-Q
P-Q
EM block 0 block 1 block 2 block 3 block 4
block 0 block 1 block 2 block 3 block 4
t=-1 t=-1 t=-1 t=-1 t=-1
block 0 block 1 block 2 block 3 block 4
t=0 t=0 t=0 t=0 t=0
EM block 0 block 1 block 2 block 3
block 1 block 2 block 3
t=-1 t=-1 t=-1
block 0 block 1 block 2 block 3
t=0 t=0 t=0 t=0
block 0 block 1
t=1 t=1
block 0
t=2P-Q
P-Q
Input
Compute
Host BufferP-Q
P-Q
GPU Buffers
P-Q
P-Q
P-Q
EM block 0 block 1 block 2 block 3 block 4 block 5
block 0 block 1 block 2 block 3 block 4 block 5
t=-1 t=-1 t=-1 t=-1 t=-1 t=-1
block 0 block 1 block 2 block 3 block 4 block 5
t=0 t=0 t=0 t=0 t=0 t=0
EM block 0 block 1 block 2 block 3 block 4
block 2 block 3 block 4
t=-1 t=-1 t=-1
block 1 block 2 block 3 block 4
t=0 t=0 t=0 t=0
block 0 block 1 block 2
t=1 t=1 t=1
block 0 block 1
t=2 t=2
block 0
t=3
Input
Compute
GPU Buffers
P-Q
P-Q
P-Q
P-Q
P-Q
Host BufferP-Q
P-Q
EM block 0 block 1 block 2 block 3 block 4 block 5 block 6
block 0 block 1 block 2 block 3 block 4 block 5 block 6
t=2 t=-1 t=-1 t=-1 t=-1 t=-1 t=-1
block 0 block 1 block 2 block 3 block 4 block 5 block 6
t=3 t=0 t=0 t=0 t=0 t=0 t=0
EM block 0 block 1 block 2 block 3 block 4 block 5
block 3 block 4 block 5
t=-1 t=-1 t=-1
block 2 block 3 block 4 block 5
t=0 t=0 t=0 t=0
block 1 block 2 block 3
t=1 t=1 t=1
block 0 block 1 block 2
t=2 t=2 t=2
block 0 block 1
t=3 t=3
P-Q
P-Q
Input
Compute
Output
Host BufferP-Q
P-Q
GPU Buffers
P-Q
P-Q
P-Q
EM block 0 block 1 block 2 block 3 block 4 block 5 block 6 block 7
block 0 block 1 block 2 block 3 block 4 block 5 block 6 block 7
t=2 t=2 t=-1 t=-1 t=-1 t=-1 t=-1 t=-1
block 0 block 1 block 2 block 3 block 4 block 5 block 6 block 7
t=3 t=3 t=0 t=0 t=0 t=0 t=0 t=0
EM block 1 block 2 block 3 block 4 block 5 block 6
block 4 block 5 block 6
t=-1 t=-1 t=-1
block 3 block 4 block 5 block 6
t=0 t=0 t=0 t=0
block 2 block 3 block 4
t=1 t=1 t=1
block 1 block 2 block 3
t=2 t=2 t=2
block 1 block 2
t=3 t=3
Input
Compute
Output
Host BufferP-Q
P-Q
GPU Buffers
P-Q
P-Q
P-Q
P-Q
P-Q
EM block N-7 block N-6 block N-5 block N-4 block N-3 block N-2 block N-1 block 0
block N-7 block N-6 block N-5 block N-4 block N-3 block N-2 block N-1 block 0
t=2 t=2 t=-1 t=-1 t=-1 t=-1 t=-1 t=2
block N-7 block N-6 block N-5 block N-4 block N-3 block N-2 block N-1 block 0
t=3 t=3 t=0 t=0 t=0 t=0 t=0 t=3
EM block N-6 block N-5 block N-4 block N-3 block N-2 block N-1
block N-3 block N-2 block N-1
t=-1 t=-1 t=-1
block N-4 block N-3 block N-2 block N-1
t=0 t=0 t=0 t=0
block N-5 block N-4 block N-3
t=1 t=1 t=1
block N-6 block N-5 block N-4
t=2 t=2 t=2
block N-6 block N-5
t=3 t=3
P-Q
P-Q
Input
Compute
Output
Host BufferP-Q
P-Q
GPU Buffers
P-Q
P-Q
P-Q
EM block N-6 block N-5 block N-4 block N-3 block N-2 block N-1 block 0 block 1
block N-6 block N-5 block N-4 block N-3 block N-2 block N-1 block 0 block 1
t=2 t=2 t=-1 t=-1 t=-1 t=-1 t=2 t=2
block N-6 block N-5 block N-4 block N-3 block N-2 block N-1 block 0 block 1
t=3 t=3 t=0 t=0 t=0 t=0 t=3 t=3
EM block N-5 block N-4 block N-3 block N-2 block N-1 block 0
block N-2 block N-1 block 0
t=-1 t=-1 t=2
block N-3 block N-2 block N-1 block 0
t=0 t=0 t=0 t=3
block N-4 block N-3 block N-2
t=1 t=1 t=1
block N-5 block N-4 block N-3
t=2 t=2 t=2
block N-5 block N-4
t=3 t=3
Input
Compute
Output
GPU Buffers
P-Q
P-Q
P-Q
P-Q
P-Q
Host BufferP-Q
P-Q
EM block N-5 block N-4 block N-3 block N-2 block N-1 block 0 block 1 block 2
block N-5 block N-4 block N-3 block N-2 block N-1 block 0 block 1 block 2
t=2 t=2 t=-1 t=-1 t=-1 t=2 t=2 t=2
block N-5 block N-4 block N-3 block N-2 block N-1 block 0 block 1 block 2
t=3 t=3 t=0 t=0 t=0 t=3 t=3 t=3
EM block N-4 block N-3 block N-2 block N-1 block 0 block 1
block N-1 block 0 block 1
t=-1 t=2 t=2
block N-2 block N-1 block 0 block 1
t=0 t=0 t=3 t=3
block N-3 block N-2 block N-1
t=1 t=1 t=1
block N-4 block N-3 block N-2
t=2 t=2 t=2
block N-4 block N-3
t=3 t=3
P-Q
P-Q
Input
Compute
Output
Host BufferP-Q
P-Q
GPU Buffers
P-Q
P-Q
P-Q
EM block N-4 block N-3 block N-2 block N-1 block 0 block 1 block 2 block 3
block N-4 block N-3 block N-2 block N-1 block 0 block 1 block 2 block 3
t=2 t=2 t=-1 t=-1 t=2 t=2 t=2 t=2
block N-4 block N-3 block N-2 block N-1 block 0 block 1 block 2 block 3
t=3 t=3 t=0 t=0 t=3 t=3 t=3 t=3
EM block N-3 block N-2 block N-1 block 0 block 1 block 2
block 0 block 1 block 2
t=2 t=2 t=2
block N-1 block 0 block 1 block 2
t=0 t=3 t=3 t=3
block N-2 block N-1 block 0
t=1 t=1 t=4
block N-3 block N-2 block N-1
t=2 t=2 t=2
block N-3 block N-2
t=3 t=3
Input
Compute
Output
Host BufferP-Q
P-Q
GPU Buffers
P-Q
P-Q
P-Q
P-Q
P-Q
EM block N-3 block N-2 block N-1 block 0 block 1 block 2 block 3 block 4
block N-3 block N-2 block N-1 block 0 block 1 block 2 block 3 block 4
t=2 t=2 t=-1 t=2 t=2 t=2 t=2 t=2
block N-3 block N-2 block N-1 block 0 block 1 block 2 block 3 block 4
t=3 t=3 t=0 t=3 t=3 t=3 t=3 t=3
EM block N-2 block N-1 block 0 block 1 block 2 block 3
block 1 block 2 block 3
t=2 t=2 t=2
block 0 block 1 block 2 block 3
t=3 t=3 t=3 t=3
block N-1 block 0 block 1
t=1 t=4 t=4
block N-2 block N-1 block 0
t=2 t=2 t=5
block N-2 block N-1
t=3 t=3
P-Q
P-Q
Input
Compute
Output
Host BufferP-Q
P-Q
GPU Buffers
P-Q
P-Q
P-Q
EM block N-2 block N-1 block 0 block 1 block 2 block 3 block 4 block 5
block N-2 block N-1 block 0 block 1 block 2 block 3 block 4 block 5
t=2 t=2 t=2 t=2 t=2 t=2 t=2 t=2
block N-2 block N-1 block 0 block 1 block 2 block 3 block 4 block 5
t=3 t=3 t=3 t=3 t=3 t=3 t=3 t=3
EM block N-1 block 0 block 1 block 2 block 3 block 4
block 2 block 3 block 4
t=2 t=2 t=2
block 1 block 2 block 3 block 4
t=3 t=3 t=3 t=3
block 0 block 1 block 2
t=4 t=4 t=4
block N-1 block 0 block 1
t=2 t=5 t=5
block N-1 block 0
t=3 t=6
Input
Compute
Output
Host BufferP-Q
P-Q
GPU Buffers
P-Q
P-Q
P-Q
P-Q
P-Q
EM block N-1 block 0 block 1 block 2 block 3 block 4 block 5 block 6
block N-1 block 0 block 1 block 2 block 3 block 4 block 5 block 6
t=2 t=5 t=2 t=2 t=2 t=2 t=2 t=2
block N-1 block 0 block 1 block 2 block 3 block 4 block 5 block 6
t=3 t=6 t=3 t=3 t=3 t=3 t=3 t=3
EM block 0 block 1 block 2 block 3 block 4 block 5
block 3 block 4 block 5
t=2 t=2 t=2
block 2 block 3 block 4 block 5
t=3 t=3 t=3 t=3
block 1 block 2 block 3
t=4 t=4 t=4
block 0 block 1 block 3
t=5 t=5 t=5
block 0 block 1
t=6 t=6
P-Q
P-Q
Input
Compute
Output
Host BufferP-Q
P-Q
GPU Buffers
P-Q
P-Q
P-Q
Pipelined Acoustic TTI
© 2014 Chevron
Scaling
EM block 0 block 1 block 2 block 3 block 4 block 5 block 6 block 7 block 8 block 9 block 10 block 11 block 12
P-Q block 0 block 1 block 2 block 3 block 4 block 5 block 6 block 7 block 8 block 9 block 10 block 11 block 12
t=5 t=5 t=-1 t=-1 t=-1 t=-1 t=-1 t=-1 t=-1 t=-1 t=-1 t=-1 t=-1
P-Q block 0 block 1 block 2 block 3 block 4 block 5 block 6 block 7 block 8 block 9 block 10 block 11 block 12
t=6 t=6 t=0 t=0 t=0 t=0 t=0 t=0 t=0 t=0 t=0 t=0 t=0
EM block 6 block 7 block 8 block 9 block 10 block 11
block 9 block 10 block 11
t=-1 t=-1 t=-1
block 8 block 9 block 10 block 11
t=0 t=0 t=0 t=0
block 7 block 8 block 9
t=1 t=1 t=1
block 6 block 7 block 8
t=2 t=2 t=2
block 6 block 7
t=3 t=3
EM block 1 block 2 block 3 block 4 block 5 block 6
block 4 block 5 block 6
t=2 t=2 t=2
block 3 block 4 block 5 block 6
t=3 t=3 t=3 t=3
block 2 block 3 block 4
t=4 t=4 t=4
block 1 block 2 block 3
t=5 t=5 t=5
block 1 block 2
t=6 t=6
P-Q
P-Q
P-Q
P-Q
GPU 1
P-Q
P-Q
P-Q
P-Q
P-Q
P-Q
GPU 0
Host Buffer
8
© 2014 Chevron
How Much Memory Is Required?
On host side, 24b per cell is needed
On GPU side
– Earth model buffer holds 6 blocks, 8b per cell
– Wavefield buffers hold 2-4 blocks, 8b per cell
– X block size is 16
– Y and Z block size is NY and NZ
sizeof(EM_buffer) = 16*NY*NZ*6*8b
sizeof(WF_buffers) = (16*NY*NZ)*(3+4+3+3+2)*8b
Ex.:
– 1000^3 volume
– GPU memory = 16*1000^2*21*8b ~= 2.5GB
– Host memory = 1000^3*24b ~= 22.4GB
9
© 2014 Chevron
PCIE Bandwidth Not a Bottleneck
The cycle rate constrains compute rate
– Cycle rate = PCIE BW / bytes per cell
– PCIE is full duplex
– Acoustic TTI cycle rate
• ~12.2GBps / 24b ~= 500Mpts/s
Max compute rate
– Cycle rate * #Timesteps
– ~500Mpts/s * num_timesteps = 1.5-3.0Gpts/s
– Max compute rate >> actual compute rate
10
© 2014 Chevron
Acoustic TTI Conclusion
Very large volumes can be propagated on a
single GPU
PCIE bandwidth is not a bottleneck
– Out-of-core implementation as fast as any
other implementation
Scales linearly up to 16 GPUs
11
© 2014 Chevron
Elastic Orthorhombic Kernel
Pair of coupled PDE’s
12 Wavefields + 25 earth model attributes
– Wavefields require 48 bytes per cell
– Earth model can be packed into 16 bytes
• On-the-fly rotation
• Quantization
– Need 64 bytes per cell
8th order in space, 2nd order in time
– ..effective stencil length is 8!
– Most convenient X block size is 4.
12
© 2014 Chevron
SEAM-II
1600 x 1600 x 600 cells
Up to elastic tilted orthorhombic mode
~92GB of bulk memory required for
wavefields and earth model
13
© 2014 Chevron
GPU Configuration for SEAM-II
Throughput limited by PCIE bandwidth
– ~12.2GB/s full duplex with Gen3x16
– Host-to-device transfer is bottleneck
– 12200 MB/s / 64 bytes ~= 190 MM cells / second
– Doing 4 timesteps, so maximum compute throughput is 190 x 4 ~= 760 MM cells / second
14
Block # -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 NX NY NZ bytes / cell # Blocks MB / buffer
EM X X X X X X X X X X H2D 4 1600 600 16 11 645
0-ST X X H2D 4 1600 600 24 3 264
0-PV X X X H2D 4 1600 600 24 4 352
1-ST X X C 4 1600 600 24 3 264
1-PV X X C 4 1600 600 24 3 264
2-ST X X C 4 1600 600 24 3 264
2-PV X X C 4 1600 600 24 3 264
3-ST X X C 4 1600 600 24 3 264
3-PV X X C 4 1600 600 24 3 264
4-ST D2H X C 4 1600 600 24 3 264
4-PV D2H C 4 1600 600 24 2 176
3,281
Bulk Memory
© 2014 Chevron
Dual GPU Configuration for SEAM-II
No additional host <-> device transfers
Dedicated links for all transfers -> linear scaling of PCIE bandwidth
Number of timesteps proportional to number of GPUs
Compute throughput scales linearly with number of GPUs
15
Block # -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0
EM D2D X X X X X X X X X H2D
0-ST X X H2D
0-PV X X X H2D
GPU 0 1-ST X X C
1-PV X X C
2-ST X X C
2-PV X X C
3-ST X X C
3-PV X X C
4-ST D2D X C
4-PV D2D C
Block # -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10
EM X X X X X X X X X X D2D
4-ST X X D2D
4-PV X X X D2D
5-ST X X C
GPU 1
5-PV X X C
6-ST X X C
6-PV X X C
7-ST X X C
7-PV X X C
8-ST D2H X C
8-PV D2H C
© 2014 Chevron
Bulk Memory Transfers
16
Block # -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 NX NY NZ bytes / cell # Blocks MB / buffer
EM X X X X X X X X X X H2D 4 1600 600 16 11 645
0-ST X X H2D 4 1600 600 24 3 264
0-PV X X X H2D 4 1600 600 24 4 352
1-ST X X C 4 1600 600 24 3 264
1-PV X X C 4 1600 600 24 3 264
2-ST X X C 4 1600 600 24 3 264
2-PV X X C 4 1600 600 24 3 264
3-ST X X C 4 1600 600 24 3 264
3-PV X X C 4 1600 600 24 3 264
4-ST D2H X C 4 1600 600 24 3 264
4-PV D2H C 4 1600 600 24 2 176
3,281
Bulk Memory
Buffer
Buffer
Buffer
Buffer
Bulk memory is
regular swappable
memory
Pair of pinned
memory buffers
enable DMA transfers
between bulk memory
and GPU memory
This arrangement
triples host memory
bandwidth
requirement
© 2014 Chevron
Fast Host memcpy()
Has to be
– Multi-threaded
– NUMA aware
– Use SSE/AVX streaming writes
Openmp code can be NUMA aware
– Requires static scheduling
– Multi-threaded memclear method
– Multi-threaded memcpy() method
– Both methods must have exact same loop dimensions
– Lock threads to specific cores
• setenv GOMP_CPU_AFFINITY 0,10,1,11,2,12, etc.
– Leave two cores for miscellanous
• setenv OMP_NUM_THREADS 18
Pinned memory cannot be allocated with cudaHostAlloc
– posix_memalign on vm page boundary
– Initialize all pages with multi-threaded memclear
– cudaHostRegister
– Use multi-threaded memcpy to move data between pinned- and bulk memory
90+ GB/s average on dual socket IvyBridge server with 1833MHz DIMMs
17
© 2014 Chevron
SEAM-II @ 120Hz
What if I want to run SEAM-II at 120Hz?
– Need twice as many cells per dimension
– 3200 x 3200 x 1200 ~= 12.3 billion cells
Need ~= 732GB of bulk memory
– Single node with 768GB host memory
– Use 3-4 nodes + IB-FDR
K10 memory too small
– Make it fit: Buy K40’s and drop a timestep
– Create multiple parallel GPU pipes
18
© 2014 Chevron
Multiple Parallel GPU Pipelines
19
© 2014 Chevron 20
Blk # -8 -7 -6 -5 -4 -3 -2 -1 0 NX NY NZ bytes / cell # Blocks MB / buffer
EM D2D X X X X X X X H2D 4 984 1200 16 9 649
0-ST X X H2D 4 984 1200 24 3 324
0-PV X X X H2D 4 992 1200 24 4 436
1-ST X X C 4 984 1200 24 3 324
1-PV X X C 4 976 1200 24 3 322
2-ST X X C 4 968 1200 24 3 319
2-PV X X C 4 960 1200 24 3 316
3-ST D2D X C 4 952 1200 24 3 314
3-PV D2D C 4 944 1200 24 2 207
120.5 3,212
Blk # -16 -15 -14 -13 -12 -11 -10 -9 -8 NX NY NZ bytes / cell # Blocks MB / buffer
EM D2D X X X X X X X D2D 4 936 1200 16 9 617
3-ST X X D2D 4 936 1200 24 3 308
3-PV X X X D2D 4 944 1200 24 4 415
4-ST X X C 4 936 1200 24 3 308
4-PV X X C 4 928 1200 24 3 306
5-ST X X C 4 920 1200 24 3 303
5-PV X X C 4 912 1200 24 3 301
6-ST D2D X C 4 904 1200 24 3 298
6-PV D2D C 4 896 1200 24 2 197
114.5 3,053
Blk # -24 -23 -22 -21 -20 -19 -18 -17 -16 NX NY NZ bytes / cell # Blocks MB / buffer
EM D2D X X X X X X X D2D 4 888 1200 16 9 585
6-ST X X D2D 4 888 1200 24 3 293
6-PV X X X D2D 4 896 1200 24 4 394
7-ST X X C 4 888 1200 24 3 293
7-PV X X C 4 880 1200 24 3 290
8-ST X X C 4 872 1200 24 3 287
8-PV X X C 4 864 1200 24 3 285
9-ST D2D X C 4 856 1200 24 3 282
9-PV D2D C 4 848 1200 24 2 186
108.5 2,895
Blk # -32 -31 -30 -29 -28 -27 -26 -25 -24 NX NY NZ bytes / cell # Blocks MB / buffer
EM X X X X X X X X D2D 4 840 1200 16 9 554
9-ST X X D2D 4 840 1200 24 3 277
9-PV X X X D2D 4 848 1200 24 4 373
10-ST X X C 4 840 1200 24 3 277
10-PV X X C 4 832 1200 24 3 274
11-ST X X C 4 824 1200 24 3 272
11-PV X X C 4 816 1200 24 3 269
12-ST D2H X C 4 808 1200 24 3 266
12-PV D2H C 4 800 1200 24 2 176
102.5 2,737
GPU 4
GPU 5
GPU 6
GPU 7
© 2014 Chevron
Simple Load Balancing
Don’t have to completely propagate a block
before passing it on to the next GPU
Gives us an easy way to achieve fine
grained load balancing
Halo overhead cost is spread evenly
among all the GPUs. This halves halo
overhead cost for the pipeline.
21
© 2014 Chevron
Host Pipeline
22
© 2014 Chevron
Host Pipeline
Compute throughput scales linearly
Bulk memory capacity scales linearly
Very easy to deploy on cloud machines
Ex.:
– Cloud GPU node with 2 M0290’s, 32GB RAM
– Three of these nodes can propagate SEAM-II
with 3x the throughput of a single node
23
© 2014 Chevron
Parallel Host Pipelines
24
© 2014 Chevron
Parallel Host Pipelines
Fixed halo overhead (~10-20%)
Compute throughput scales linearly
Bulk memory capacity scales linearly
25
© 2014 Chevron
Benefits
Capacity
– Can handle larger volumes than other designs
– Single GPU can propagate volumes with 10+ billion cells
– Multiple nodes can propagate volumes with 1+ trillion cells
Performance
Flexibility
– Code can adapt to any hardware configuration, easy to deploy on cloud infrastructure
– Works very well with long offset models
Scalability
– Linear scaling within a node
– Linear scaling across nodes with single host pipe
– Linear scaling with fixed overhead with multiple parallel host pipes
26
© 2014 Chevron
References
Etgen, J and O’Brien, M (2007):
“Computational methods for large-scale 3D
acoustic finite-difference modeling: A
tutorial.” Journal of Geophysics, Vol.72,
No.5, P.SM223-SM230.
Vitter, JS (2001): “External Memory
Algorithms and Data Structures: Dealing
with MASSIVE DATA.” ACM Computing
Surveys 33 (2): 209-271
27
© 2014 Chevron
Acknowledgements
I would like to thank Alex Loddoch and Tech Computing for inspired discussions and access to computing hardware
I would like to thank Joe Stephani and Kurt Nihei for providing me with reference codes and helping me qualify the optimized codes. I would like to thank Dimitri Bevc for supporting this project
Last, but not least, I would like to thank Chevron for letting me publish these results
28