vectors with values on the jvm - oraclecr.openjdk.java.net/.../j1-2017-vector-api-con4826.pdf ·...
Embed Size (px)
TRANSCRIPT

Vectors with Values on the JVM
Razvan Lupusoru – Intel
Paul Sandoz – Oracle @PaulSandoz October 5, 2017

Intel Legal Disclaimer & Optimization Notice • INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY
INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
• SoRware and workloads used in performance tests may have been opZmized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, soRware, operaZons and funcZons. Any change to any of those factors may cause the results to vary. You should consult other informaZon and performance tests to assist you in fully evaluaZng your contemplated purchases, including the performance of that product when combined with other products.
• Copyright © 2016, Intel CorporaZon. All rights reserved. Intel, PenZum, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel CorporaZon in the U.S. and other countries.
Op#miza#on No#ce
Intel’s compilers may or may not opZmize to the same degree for non-‐Intel microprocessors for opZmizaZons that are not unique to Intel microprocessors. These opZmizaZons include SSE2, SSE3, and SSSE3 instrucZon sets and other opZmizaZons. Intel does not guarantee the availability, funcZonality, or effecZveness of any opZmizaZon on microprocessors not manufactured by Intel. Microprocessor-‐dependent opZmizaZons in this product are intended for use with Intel microprocessors. Certain opZmizaZons not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more informaZon regarding the specific instrucZon sets covered by this noZce.
NoZce revision #20110804
2

Oracle Safe Harbor Statement
3
The following is intended to outline our general product direcZon. It is intended for informaZon purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or funcZonality, and should not be relied upon in making purchasing decisions. The development, release, and Zming of any features or funcZonality described for Oracle’s products remains at the sole discreZon of Oracle.

Java first, Java always, Java for ML
4

Overview
• Explain SIMD and why it is useful • Introduce the Vector API and some examples • Deep dive into how the Vector API is opZmized on Intel CPUs
5

Let’s talk SIMD (by John Rose) (the fight for love and glory!)
It’s sZll the same old story, Though now it's mulZ-‐core-‐y; The data lane-‐wise fly. The SIMD basics never lie, (The Monoid functor maps “APPLY”) As Zme goes by.
6

What is SIMD?
• Single InstrucZon MulZple Data • With one “instrucZon” operate on more stuff
7

Scalar addition
8
sta#c void scalarAdd(int[] a, int[] b, int[] r) { for (int i = 0; i < a.length; i++) { r[i] = a[i] + b[i]; } }

Unrolled scalar addition
9
sta#c void unrolledScalarAdd(int[] a, int[] b, int[] r) { int i = 0; int lanes = 4; for (; i < a.length -‐ a.length % lanes; i += lanes) { r[i + 0] = a[i + 0] + b[i + 0]; r[i + 1] = a[i + 1] + b[i + 1]; r[i + 2] = a[i + 2] + b[i + 2]; r[i + 3] = a[i + 3] + b[i + 3]; } if (i > a.length) { i -‐= lanes; for (; i < a.length; i++) { r[i] = a[i] + b[i]; } } }

SIMD addition
10
sta#c void simdAdd(int[] a, int[] b, int[] r) { int i = 0; int lanes= 4; for (; i < a.length -‐ a.length % lanes; i += lanes) { r[i + 0] a[i + 0] b[i + 0]; r[i + 1] a[i + 1] b[i + 1]; r[i + 2] a[i + 2] b[i + 2]; r[i + 3] a[i + 3] b[i + 3]; } if (i > a.length) { i -‐= lanes; for (; i < a.length; i++) { r[i] = a[i] + b[i]; } } }
= +

SIMD addition with the Vector API
11
sta#c void vectorAdd(int[] a, int[] b, int[] r) { int i = 0; int lanes= 4; for (; i < a.length -‐ a.length % lanes; i += lanes) { IntVector<…> av = INT_256_SPECIES.fromArray(a, i); IntVector<…> bv = INT_256_SPECIES.fromArray(b, i); av.add(bv).intoArray(r, i); } if (i > a.length) { i -‐= lanes; for (; i < a.length; i++) { r[i] = a[i] + b[i]; } } }

SIMD masking the tail with the Vector API
12
sta#c void vectorAdd(int[] a, int[] b, int[] r) { int i = 0; int lanes= 4; for (; i < a.length -‐ a.length % lanes; i += lanes) { IntVector<…> av = INT_256_SPECIES.fromArray(a, i); IntVector<…> bv = INT_256_SPECIES.fromArray(b, i); av.add(bv).intoArray(r, i); } if (i > a.length) { Vector.Mask<…> m = tailMask(a.length, lanes); i -‐= lanes; IntVector<…> av = INT_256_SPECIES.fromArray(a, i, m); IntVector<…> bv = INT_256_SPECIES.fromArray(b, i, m); av.add(bv).intoArray(r, i, m); } }

SIMD specific use cases
• Image manipulaZon • Linear Algebra (BLAS) • Machine/Deep learning • (Matrix mulZplicaZon, both sparse and deep) • Cryptographic algorithms • Financial applicaZons • Numerous use-‐cases within the JDK
13

Goal of Vector API
• Express simple and complex SIMD-‐based computaZons with clear code
• Good reliable performance maximizing use of a processor (Intel, ARM, GPU?)
• Graceful degradaZon when funcZonality not available
14

What about auto-vectorization?
• The runZme compiler can convert some scalar loops into vectorized loops is fragile (superword)
• The set of loops recognized is limited and can be fragile
15

Vector API overview
public interface Vector<E, S extends …> { … Vector<E, S> add(Vector<E, S> o); Vector<E, S> add(Vector<E, S> o, Mask<E, S> m); } • A Vector has two type variables
• A scalar type, E, and a shape, S • A shape defines how many scalars are packed together (# lanes) • Vectors of the same shape can be directly operated on
16

Vector API overview
public interface Species<E, S extends …> { … Vector<E, S> zero(); Vector<E, S> fromByteArray(byte[] bs, int ix); } • A Vector is instanZated from a species (factory)
17

Vector API use cases in the JDK
• GeneraZng hash codes • Array mismatch • SorZng (counZng ascending/descending runs)
18

Improvements: or The Monoid functor maps “APPLY”! DoubleVector<S> map(StrictMath::log1p) • Use higher order funcZons
• Requires “lambda cracking” to extract out the scalar operaZon and apply the SIMD equivalent instrucZon(s)
• Many operaZons can be folded into a general map operaZon • This will reduce the number of explicit operaZons
• Including scalar to vector conversions, such as broadcast • Vector wants to be a value type whose element type is also a value
19

Hardware vs Java Mismatch
Size (bits) 8 16 32 64 128 256 512 …
X86 Register
AL AX EAX RAX XMM0 YMM0 ZMM0 …
Java Type byte short int long …
20

Hardware vs Java Mismatch
Size (bits) 8 16 32 64 128 256 512 …
X86 Register
AL AX EAX RAX XMM0 YMM0 ZMM0 …
Java Type byte short int long Int128Vector Float128Vector Double128Vector …
Int256Vector Float256Vector Double256Vector …
Int512Vector Float512Vector Double512Vector …
…
21

Value Types Primer
• Value Type -‐ user-‐defined primiZve type • A value type holds its own data in its allocated memory.
• Vector API classes can all be value types: • The interest is in the VALUE they hold, and not the container.
22

Int128Vector
23
Value Header 128 bit Value
Object Header Array Field
Object Header Array Length int_0 int_1 int_3 int_2
Value Type Memory Layout
Vector API Current Memory Layout
opZonal

So why not use value types?
• Ideally, we should be using value types. • But…
1. Value Type support is in Valhalla. Vector API work is in Panama. 2. Before Value Types, Minimal Value Types are coming:
• Supported at VM level, language level changes are further out. • To use them, Method Handles and combinators are used to refer to value types and pass them around.
• Very verbose language is needed to construct and put together: • Vladimir Ivanov (Oracle) coined the term Vector Pain-‐gramming when referring to verbosity needed to construct simple vector algorithms.
24

Absence of Value Types
• Value type support is the real soluZon for ensuring performance of Vector API.
• But, we have another trick in meanZme: • Use Hotspot C2 type system to map Vector API classes to appropriate registers to hold their values.
25
Int128Vector TypeVect <int, 4> xmm

Intrinsification of Vector API
• IntrinsificaZon: Name of compiler technique used to replace a method’s implementaZon with a faster, hand-‐opZmized version.
• Vector API IntrinsificaZon: Used to convert calls to the Vector API to intermediate representaZon in compiler that represents desired operaZon semanZcs.
vec3 = vec1.add(vec2); AddVFNode
26

for (int i = 0; i + spec.length() < a.length; i += spec.length()) {
FloatVector<S> vec1 = spec.fromArray(a, i);
FloatVector<S> vec2 = spec.fromArray(b, i);
vec1.add(vec2).intoArray(c, i);
}
for (int i = 0; i < a.length; i++) {
c[i] = a[i] + b[i];
}
27

for (int i = 0; i + spec.length() < a.length; i += spec.length()) {
FloatVector<S> vec1 = spec.fromArray(a, i);
FloatVector<S> vec2 = spec.fromArray(b, i);
vec1.add(vec2).intoArray(c, i);
}
ConINode
StoreVectorNode
LoadVectorNode <float, 8>
LoadVectorNode <float, 8>
AddVFNode <float, 8>
28

StoreVectorNode
LoadVectorNode (float, 8)
LoadVectorNode (float, 8)
AddVFNode (float, 8)
vmovdqu 0x10(%r11,%rbx,4),%ymm0
vmovdqu 0x10(%r10,%rbx,4),%ymm1
vaddps %ymm1,%ymm0,%ymm0
vmovdqu %ymm0,0x10(%r8,%rbx,4)
29

Actual code generated vmovdqu 0x10(%r11,%rbx,4),%ymm0 vmovdqu 0x10(%r10,%rbx,4),%ymm1 vaddps %ymm1,%ymm0,%ymm0 vmovdqu %ymm0,0x10(%r8,%rbx,4) vmovdqu 0x30(%r11,%rbx,4),%ymm0 vmovdqu 0x30(%r10,%rbx,4),%ymm1 vaddps %ymm1,%ymm0,%ymm0 vmovdqu %ymm0,0x30(%r8,%rbx,4) vmovdqu 0x50(%r11,%rbx,4),%ymm0 vmovdqu 0x50(%r10,%rbx,4),%ymm1 vaddps %ymm1,%ymm0,%ymm0 vmovdqu %ymm0,0x50(%r8,%rbx,4) vmovdqu 0x70(%r11,%rbx,4),%ymm0 vmovdqu 0x70(%r10,%rbx,4),%ymm1 vaddps %ymm1,%ymm0,%ymm0 vmovdqu %ymm0,0x70(%r8,%rbx,4) vmovdqu 0x90(%r11,%rbx,4),%ymm0 vmovdqu 0x90(%r10,%rbx,4),%ymm1 vaddps %ymm1,%ymm0,%ymm0 vmovdqu %ymm0,0x90(%r8,%rbx,4) vmovdqu 0xb0(%r11,%rbx,4),%ymm0 vmovdqu 0xb0(%r10,%rbx,4),%ymm1
vaddps %ymm1,%ymm0,%ymm0 vmovdqu %ymm0,0xb0(%r8,%rbx,4) vmovdqu 0xd0(%r11,%rbx,4),%ymm0 vmovdqu 0xd0(%r10,%rbx,4),%ymm1 vaddps %ymm1,%ymm0,%ymm0 vmovdqu %ymm0,0xd0(%r8,%rbx,4) vmovdqu 0xf0(%r11,%rbx,4),%ymm0 vmovdqu 0xf0(%r10,%rbx,4),%ymm1 vaddps %ymm1,%ymm0,%ymm0 vmovdqu %ymm0,0xf0(%r8,%rbx,4) add $0x40,%ebx cmp %edi,%ebx jl 0x00007f9f6fab4ab0
30
Key Takeaways: • Super unrolled 8 Zmes • No safepoints • No boxing/unboxing overheads

Why Intrinsification?
Mature support on compiler side and no new technology to introduce.
Reduced dependence on value types.
Thorough assembler support.
Faster TTM for Vector API with the promise of performance.
Takes advantage of exisZng compiler opZmizaZons like unrolling and scheduling.
31

Goals with Intrinsification
• Ability to access SIMD from naZve architecture: • YES -‐ IntrinsificaZon converts API calls to representaZon that maps to naZve vector instrucZons.
• Performance: • YES -‐ Vectorized code gets generated.
• Graceful degradaZon: • YES (upcoming) -‐ For operaZons not supported on a parZcular architecture, boxing/unboxing will occur just for that operaZon. Also, can downsize vectors to fit in naZve architecture size.
32

Challenges with Lack of Value Type Support
33
Control flow merges for objects limit escape analysis Escaping Vector objects need boxing. Method calls that pass Vector instances
around need container not just value.
IntVector<S> accum = spec.zero(); for (...) { accum = accum.add(spec.broadcast(1)); }
return spec.fromArray(a,i); IntVector<S> val = spec.broadcast(42); compute(val);
• SyntheZcally insert a “VectorUnbox” to transfer between object and value.
• Expand VectorUnbox to Vector PHI node if both inputs are vector values.
• Generate allocator to create an empty Vector object.
• Fill field of Vector object with value and return newly created object.
• Create Vector object and fill with value just as in the return case.

Image Processing Application: Sepia Filter • Filter applies Sepia toning to an image.
34
for (int i = 0; i < width * height; i++) { magnitudeR[i] = 0.393f * redFlat[i] + 0.769f * greenFlat[i] + 0.189f * blueFlat[i]; magnitudeG[i] = 0.349f * redFlat[i] + 0.686f * greenFlat[i] + 0.168f * blueFlat[i]; magnitudeB[i] = 0.272f * redFlat[i] + 0.534f * greenFlat[i] + 0.131f * blueFlat[i]; if(255.0f < magnitudeR[i]) magnitudeR[i] = 255.0f; if(255.0f < magnitudeG[i]) magnitudeG[i] = 255.0f; if(255.0f < magnitudeB[i]) magnitudeB[i] = 255.0f; }
for (int i = 0; i < width * height; i += 8) { FloatVector<Shapes.S256Bit> c1 = fspec.broadcast(0.393f); FloatVector<Shapes.S256Bit> c2 = fspec.broadcast(0.769f); FloatVector<Shapes.S256Bit> c3 = fspec.broadcast(0.189f); FloatVector<Shapes.S256Bit> c4 = fspec.broadcast(0.349f); FloatVector<Shapes.S256Bit> c5 = fspec.broadcast(0.686f); FloatVector<Shapes.S256Bit> c6 = fspec.broadcast(0.168f); FloatVector<Shapes.S256Bit> c7 = fspec.broadcast(0.272f); FloatVector<Shapes.S256Bit> c8 = fspec.broadcast(0.534f); FloatVector<Shapes.S256Bit> c9 = fspec.broadcast(0.131f); FloatVector<Shapes.S256Bit> c10 = fspec.broadcast(255f); FloatVector<Shapes.S256Bit> redVec = fspec.fromArray(redFlat, i); FloatVector<Shapes.S256Bit> greenVec = fspec.fromArray(greenFlat, i); FloatVector<Shapes.S256Bit> blueVec = fspec.fromArray(blueFlat, i); FloatVector<Shapes.S256Bit> res1 = redVec.mul(c1).add(greenVec.mul(c2)).add(blueVec.mul(c3)); FloatVector<Shapes.S256Bit> res2 = redVec.mul(c4).add(greenVec.mul(c5)).add(blueVec.mul(c6)); FloatVector<Shapes.S256Bit> res3 = redVec.mul(c7).add(greenVec.mul(c8)).add(blueVec.mul(c9)); res1.blend(c10, res1.lessThan(c10)).intoArray(magnitudeR, i); res2.blend(c10, res2.lessThan(c10)).intoArray(magnitudeG, i); res3.blend(c10, res3.lessThan(c10)).intoArray(magnitudeB, i); }

SEPIA Filter
35
vmulps ymm3,ymm2,ymm1 vmulps ymm1,ymm4,ymm13 vmulps ymm0,ymm5,ymm4 vaddps ymm0,ymm3,ymm0 vaddps ymm0,ymm0,ymm1 vcmpps ymm1,ymm0,ymm12,1h vblendvps ymm0,ymm12,ymm0,ymm1 vmovdqu ymmword ptr [r12+r11*8+10h],ymm0
AVX2 SIMD arithmeZc operaZons
advanced vector operaZon for blend
Image before filtering aRer filtering from Vector API implementaZon
up to 6x faster than original implementaZon

BLAS Performance
36
• BLAS I, II algorithms are used in Machine Learning libraries for linear models (logisZc and linear regression), collaboraZve filtering etc. (for example, Spark ML)
• BLAS III rouZnes like GEMM are applicable to neural network and deep learning algorithms. • Up to 4.5X performance speed-‐up across BLAS I/II and III algorithms*
upto 4.5X performance speed-‐up on BLAS rouZnes*
*Open JDK Project Panama source build 09182017. Java Hotspot 64-‐bit Server VM (mixed mode). OS version: Cent OS 7.3 64-‐bit Intel® Xeon® PlaZnum 8180 processor (using 512 byte and 1024 byte chunk of floaZng point data). JVM opZons: -‐XX:+UnlockDiagnosZcVMOpZons -‐XX:-‐CheckIntrinsics -‐XX:TypeProfileLevel=121 -‐XX:+UseVectorApiIntrinsics
0
1
2
3
4
5
SDOT SSPR SSYR SGEMM
Vector opera#ons matrix-‐vector opera#ons matrix-‐matrix opera#ons
BLAS rou#nes
Vector API performance improvements

Intrinsification coverage of Vector API
• Parts of API are supported via intrinsificaZon: • Float, Double, and Int Vectors of 128, 256, and 512 sizes have parZal API support.
• add, sub, mul, div, sumAll are supported. equal, lessThan, and blend are supported for 128 and 256 size. The rest are in development and will arrive on a regular basis throughout the rest of this year and early 2018.
• All examples shown with generated code are supported. • If experimenZng with use of API is desired without performance requirement, pass “-‐XX:-‐UseVectorApiIntrinsics” to VM to disable use of Vector API intrinsificaZon.
• This ensures stability and full coverage with Java implementaZon (but will be slow for now).
37

FloatVector<Shapes.S256Bit> offsets = F_SPEC.fromArray(new float[]{0, 1, 2, 3, 4, 5, 6, 7}, 0); FloatVector<Shapes.S256Bit> vwidth = F_SPEC.broadcast(width); FloatVector<Shapes.S256Bit> vheight = F_SPEC.broadcast(height); FloatVector<Shapes.S256Bit> two = F_SPEC.broadcast(2); FloatVector<Shapes.S256Bit> sheight = vheight.div(two); FloatVector<Shapes.S256Bit> swidth = vwidth.div(two); FloatVector<Shapes.S256Bit> thresh = F_SPEC.broadcast(4); FloatVector<Shapes.S256Bit> zoomfactor1 = F_SPEC.broadcast(0.73f); FloatVector<Shapes.S256Bit> zoomfactor2 = F_SPEC.broadcast(0.10f); FloatVector<Shapes.S256Bit> zoomstep = F_SPEC.broadcast(this.zoom_cur_step); IntVector<Shapes.S256Bit> iones = I_SPEC.broadcast(1); IntVector<Shapes.S256Bit> max = I_SPEC.broadcast(this.iterations); for (int row = 0; row < height; row++) { for (int col = 0; col < width; col += F_SPEC.length()) { FloatVector<Shapes.S256Bit> cre = F_SPEC.broadcast(col).add(offsets).sub(swidth); cre = cre.mul(thresh).div(vwidth).div(zoomstep).sub(zoomfactor1); FloatVector<Shapes.S256Bit> cim = F_SPEC.broadcast(row).sub(sheight).mul(thresh); cim = cim.div(vwidth).div(zoomstep).add(zoomfactor2); FloatVector<Shapes.S256Bit> x = F_SPEC.zero(); FloatVector<Shapes.S256Bit> y = F_SPEC.zero(); IntVector<Shapes.S256Bit> iter = I_SPEC.zero(); Vector.Mask<Float, Shapes.S256Bit> mres = F_SPEC.trueMask(); while (mres.anyTrue() && iter.lessThan(max).allTrue()) { FloatVector<Shapes.S256Bit> x_new = x.mul(x).sub(y.mul(y)).add(cre); y = two.mul(x).mul(y).add(cim); x = x_new; IntVector<Shapes.S256Bit> temp = iter.add(iones); iter = temp.blend(iter, mres.rebracket(Integer.class)); mres = x.mul(x).add(y.mul(y)).lessThan(thresh); } IntVector<Shapes.S256Bit> res = iter.blend(I_SPEC.zero(), iter.lessThan(max)); res.intoArray(buff, 0); for (int i = 0; i < buff.length; i++) { image.setRGB(col + i, row, colors[buff[i] % colors.length]); } } }

Where to get it? How to use it?
• Project Panama contains Vector API: • h�p://openjdk.java.net/projects/panama/
• Intel Developer Zone -‐ Vector API Developer Program • h�ps://soRware.intel.com/en-‐us/arZcles/vector-‐api-‐developer-‐program-‐for-‐java
• Webpage with informaZon for developers to get started on Vector API • Code samples on standard BLAS and FSI algorithms
39

Backup
40

Contributors Oracle: • Paul Sandoz • John Rose • Vladimir Ivanov Intel • Razvan Lupusoru • Vivek Deshpande • Rahul Kundu • Sandhya Viswanathan • Shravya Rukmannagari • Ian Graves (ex-‐Intel)
41

Int128Vector
Value Header 128 bit Value
Value Type Possible Memory Layout 1
opZonal
Value Header 64 bit long
Value Type Possible Memory Layout 2
opZonal
Value Header
Value Type Possible Memory Layout 2
opZonal
64 bit long
32 bit int 32 bit int 32 bit int 32 bit int