spark and gpus - meetupfiles.meetup.com/18712511/nvresearch-spark-20160407_final.pdf · how is...
Post on 21-May-2020
4 Views
Preview:
TRANSCRIPT
M. Naumov, J. Daw, V. Ditya, A. Fit-Florea and S. Migacz
Spark and GPUs
04/07/2016
2
Key Issues that Need to Be Addressed
Data Contiguous memory layout
Code Intercept compute intensive call
Compile Java Bytecode to PTX
Job Placement Awareness of nodes with and without GPUs
Different GPU configurations
3
Data
Contiguous memory layout Java Unsafe API
Java NIO buffers
Keep track of where is the data Reuse on CPU/GPU instead of always copying
And more Data layout in memory, UVM, …
4
Code
Intercept compute intensive call Wrap library calls (using JNI, jCUDA, SWIG, …)
Key question, what algorithms are important?
Compile Java Bytecode to PTX Likely limits the functions you can write
Maybe enough for majority of users
5
Job Placement
Awareness of nodes with/without GPUs By all schedulers, such as Mesos, YARN, …
Different configurations multiple processes per GPU
multiple GPUs per process
processes with memory requirements larger than
the memory of the GPU(s)
6
Spark Language Interfaces PyCUDA, SWIG, JNI, MLLib with NVBLAS
7
Python CUDA Bindings (PyCUDA) #CUDA kernel
mod= SourceModule(""" __global__ void vector_add(float *a, float *b, float *c) { int i = threadIdx.x; c[i]=a[i] + b[i]; } """)
#CUDA run vector_add = mod.get_function("vector_add") vector_add(drv.In(a), drv.In(b), drv.Out(c), block=(2,1,1), grid=(3,1,1))
PyCUDA
CUDA kernel
8
Caveat I:
Must be able to serialize/unserialize (Java) or pickle/unpickle (Python) the
lambda/closure/function supplied to Spark operations, such as map
In practice, this often means the function must be “self contained”
Caveat II:
Currently, there is a lot of overhead in pyCUDA, which seems to include
compiling the CUDA kernel at Spark runtime
Caveat III:
Currently, there is no way to leave and reuse data on the GPU
Nikolai Sakharnykh,
Spark - Python + PyCUDA
9
Python-C/C++ Interface Generation Tool #CUDA kernel
def test_add(n): x = [numpy.float64(i+1) for i in range(n)] y = [numpy.float64(10*(i+1)) for i in range(n)] e,r = mn.add(len(x),len(x),x,y) int add(int n, double *r, double *x, double *y) { for(int i=0; i<n; i++){ r[i]=x[i]+y[i]; } return 1; }
SWIG
Python
C/C++
Generate Python Object Layer Code …
Preamble C/C++ function call
Post amble …
10
Python-C/C++ Interface Generation Tool #CUDA kernel
def test_add(n): x = [numpy.float64 … ] y = [numpy.float64 … ] e,r = mn.add(len(x), …) int add(int n, double *r, … ) { for(int i=0; i<n; i++){ r[i]=x[i]+y[i]; } return 1; }
SWIG
Code Example (typemaps for variables):
…
%define tmp2c_v(type, name)
#define PyType_AsType PyType_AsType_##type
%typemap(in) (type name) {
$1 = PyType_AsType($input);
}
#undef PyType_AsType
%enddef
…
11
Similar to PyCUDA, but does not compile code on the fly.
Allows easier wrapping of CUDA library calls
Careful with data returned in arrays
Careful with names across multiple library calls
(they are all treated using the same rules)
SWIG can also generate interface to other languages
(for example, Java using JNI)
Nikolai Sakharnykh,
Spark - Python + SWIG
12
Can be used for Scala-C/C++ Interface #CUDA kernel class Binding { @native def iArrayMethod(a: Array[Int]): Int } object Test extends App { System.loadLibrary("Binding") val b = new Binding val sum = b.iArrayMethod(Array(1, 2,3)) … }
Java Native Interface (JNI)
Scala C/C++
JNIEXPORT jint JNICALL Java_Binding_iArrayMethod (JNIEnv* env, jobject obj, jintArray array) { int sum=0; jsize len = (*env)->GetArrayLength(env,array); jint* x = (*env)->GetIntArrayElements(env,array, 0); for (int i = 0; i < len; i++) { sum += x[i]; } (*env)->ReleaseIntArrayElements(env,array, x, 0); return sum; }
13
Spark - Scala + JNI
Similar SWIG, but using JNI instead of Python Object Layer.
Allows easier wrapping of CUDA library calls
Careful with arrays (GetIntArrayElements might make extra copies)
We have integrated this bindings into the Spark Maven project manager and
they are accessible from any classes.
14
MLLib Spark Machine Learning Library
Allows the use of native BLAS libraries (such as Intel MKL)
NVBLAS Plug-and-play: intercepts host BLAS level-3 calls
Offloads computation to CUBLAS when beneficial
Supports multiple-GPUs
Designed to support preloading (no need to even recompile the code)
…
Spark – MLLib + NVBLAS
15
Investigation of Spark Operators Basics, Prefix Sum, All-to-All
16
Existing Operators
Map, flatMap, mapPartitions[WithIndex],
Zip[WithIndex], Union, Intersect, Filter,
sortBy[Key], PartitionBy, Reduce, …
Code Example:
>>> rdd = sc.parallelize([1, 2, 3, 4], 2)
>>> res = rdd.reduce(lambda x,y: x+y)
>>> print(res)
>>> 10
shuffles
actions
transforms
1 + 2 + 3 + 4 = 10
17
Motivation for New Operators
= Ap,Ac,Av x y
Many algorithm are not easily expressed
with existing operators
Consider sparse matrix-vector multiplication
(matrix A in CSR format is represented by arrays Ap, Ac and Av)
It is a standard benchmark for HPC. Also, it is used in Power method
to compute PageRank of a webpage.
18
Coordinate (COO)
Compressed Sparse Row (CSR)
Compressed Sparse Column (CSC)
Sparse Matrix Storage Formats
1 2 2 3 4 4 4
1 1 2 3 1 3 4
1.0 2.0 3.0 4.0 5.0 6.0 7.0
Row Index
Col Index
Values
1 2 4 5 8
1 1 2 3 1 3 4
1.0 2.0 3.0 4.0 5.0 6.0 7.0
Ap
Ac
Av
1 2 4 2 3 4 4
1 4 5 7 8
1.0 2.0 3.0 4.0 5.0 6.0 7.0
Ar
Ap
Av
1.0
6.0
4.0
7.0
3.0 2.0
5.0
1 2 3 4
1
2
3
4
column-major order
row-major order
dense
19
Partitioning the matrix
=
=
=
Ap,Ac,Av
Ap1,Ac1,Av1
Ap2,Ac2,Av2
y1
y2
x
x
x y
• Partition Arrays • Insert (at index) • Compute prefix sum • Broadcast/Collect • Numeric Operations
20
numElements (per partition)
def getNumElements(self):
return self.map(lambda x: 1).reduce(lambda x,y: x+y)
def getNumLocalElements(self):
return self.mapPartitions(lambda p: [sum(1 for x in p)])
Code Example:
>>> rdd = sc.parallelize([1, 2, 3, 4], 2)
>>> ne = rdd.getNumElements() >>> nle = rdd.getNumLocalElements()
>>> 4 >>> [[2], [2]] single number RDD
1 + 1 + 1 + 1 = 4
[1 + 1], [1 + 1] = [2], [2]
same as count()
21
[find|insert|remove|swap][at]Index
def findIndex(self,e):
res = self.zipWithIndex().filter(lambda (x,k): x == e)
# check whether rdd is empty, if not then …
return res.reduce(lambda (x1,k1), (x2,k2): min(k1,k2))
Code Example:
>>> res = sc.parallelize([1, 3, 3, 2], 2).findIndex(3)
>>> print(res)
>>> 1 (be careful with 0/1 based indexing)
1 3 3 2
0 1 2 3
22
Also, need local versions
def findLocalIndex(self,e):
res = self.zipWithLocalIndex().filter(lambda (x,k): x == e)
# check whether rdd is empty, if not then …
return res.mapPartitions(find_min_in_a_list)
Code Example:
>>> res = sc.parallelize([1, 3, 3, 2], 2).findLocalIndex(3);
>>> print(res.glom().collect())
>>> [[1], [0]] (be careful with 0/1 based indexing)
1 3 3 2
0 1 0 1
23
Prefix Sum (by Key)
1 2 2 3 4 4 4
1 1 1 1 1 1 1
1 2 3 4
1 2 1 3
1 2 3 4
1 3 4 7
count
add
This can be used to convert from COO to CSR format
1 2 3 4
1 2 4 5 8 +1 (optional, based on 0/1 based indexing)
24
Prefix Sum def prefixSum(self): #compute prefix sum by shifting and filtering keys
rdd = self.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y).sortBy(lambda (k,x): k)
n = rdd.getNumElements(); offset = next_pow2(n)
1 2 1 3
1 1 1 1 1 1 1
keys are colors
1 3 4 7
final result we expect
25
Prefix Sum
def prefixSum(self): #compute prefix sum by shifting and filtering keys
rdd = self.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y).sortBy(lambda (k,x): k)
n = rdd.getNumElements(); offset = next_pow2(n)
while offset > 0:
set1= rdd.map(lambda t: t)
set2= rdd.map(lambda (k,x): (k+offset,x)).filter(lambda (k,x): k<(n+1))
rdd = set1.union(set2).reduceByKey(lambda x,y: x+y).sortBy(lambda (k,x): k)
offset = int(offset/2)
return rdd
1 2 1 3
1 2 1 3 1 2 2 5
offset=2
26
Prefix Sum
def prefixSum(self): #compute prefix sum by shifting and filtering keys
rdd = self.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y).sortBy(lambda (k,x): k)
n = rdd.getNumElements(); offset = next_pow2(n)
while offset > 0:
set1= rdd.map(lambda t: t)
set2= rdd.map(lambda (k,x): (k+offset,x)).filter(lambda (k,x): k<(n+1))
rdd = set1.union(set2).reduceByKey(lambda x,y: x+y).sortBy(lambda (k,x): k)
offset = int(offset/2)
return rdd
1 2 2 5
1 2 2 5 1 3 4 7
offset=1
27
Prefix Sum
def prefixSum(self): #compute prefix sum by shifting and filtering keys
rdd = self.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y).sortBy(lambda (k,x): k)
n = rdd.getNumElements(); offset = next_pow2(n)
while offset > 0:
set1= rdd.map(lambda t: t)
set2= rdd.map(lambda (k,x): … >>> rdd = sc.parallelize([1,2,3,4,4,2,4], 2)
rdd = set1.union(set2).redu … >>> rdd.prefixSum()
offset = int(offset/2) >>> [(1,1), (2,3), (3,4),(4,7)]
return
Code Example (we can similarly have a local variant):
28
numOps[Mixed]
def numOpsMixed(self, other, func): # ASSUMPTION: number of partitions is the same
rdd = self.zipPartitions(other) #creates an rdd whose elements are partitions
def apply_func((p,q)):
for y in q:
for x in p:
yield func(x,y)
res = rdd.flatMap(func)
return res
1 2 1 2
10 20
11 21 12 22
=
+
29
AllToAll
def allToAll(self, np, partitionFunc):
#define add_partition_index_to_each_element and use it below …
rdd = self.mapPartitionsWithIndex(add_partition_index_to_each_element)
def expand_p_index(x):
for k in range(np):
yield (k,x)
res = rdd.flatMap(expand_p_index).partitionBy(np, partitionFunc).map(lambda k,x: x)
return res.sortLocalByKey().map(lambda k,x: x)
1 2 3 1
2 2 3 3 1 1 1 1
0 1 0 1 0 1 0 1
3 1 1 2 1 2 3 1
0 0 0 0 1 1 1 1
30
Partitioning the matrix
=
=
=
Ap,Ac,Av
Ap1,Ac1,Av1
Ap2,Ac2,Av2
y1
y2
x
x
x y
• Partition Arrays • Insert (at index) • Compute prefix sum • Broadcast/Collect • Numeric Operations
31
Discussion with Audience
32
Algorithms and Challenges
What algorithms would you like to implement? PCA (SVD), SVM, ALS, K-Means, …
Are you interested in machine learning (other than deep learning)?
How is Python/Scala/Java used? What code/problems are interesting?
What is your vision for how spark should be aware of GPU resources,
in conjunction with resource manager (such as Mesos)?
What challenges do you have for using GPUs? Performance/Power/$, Memory Layout (JVM vs. C/C++), …
33
Backup Slides
34
PageRank (from Linear Algebra Perspective)
• Let Cnxn be a scaled adjacency matrix (with row sums = 1), vector b={0,1}n have 1 in place of dangling nodes (indices of empty rows), and vector u=(1/n)e where e=[1,…,1]T.
• Find the largest eigenpair (in which eigenvector = pagerank) of
Ax = \lambda x, where A = \alpha (C + buT) + (1-\alpha)(ueT)
• The simplest approach - Power method
key operation: sparse matrix-vector multiplication
top related