ijaret 06 07_003
TRANSCRIPT
http://www.iaeme.com/IJARET/index.asp 11 [email protected]
International Journal of Advanced Research in Engineering and Technology
(IJARET) Volume 6, Issue 7, Jul 2015, pp. 11-23, Article ID: IJARET_06_07_003
Available online at
http://www.iaeme.com/IJARET/issues.asp?JTypeIJARET&VType=6&IType=7
ISSN Print: 0976-6480 and ISSN Online: 0976-6499
© IAEME Publication
___________________________________________________________________________
SPARSE STORAGE RECOMMENDATION
SYSTEM FOR SPARSE MATRIX VECTOR
MULTIPLICATION ON GPU
Monika Shah
Department of Computer Science & Engineering, Nirma University
ABSTRACT
Sparse Matrix Vector Multiplication (SpMV) Ax=b is a well-known kernel
in science, engineering, and web world. Harnessing large computing
capabilities of GPU device, many sparse storage formats have been proposed
to optimize performance of SpMV on GPU. Compressed Sparse Row (CSR),
ELLPACK (ELL), Hybrid (HYB), and Aligned COO sparse storage formats
are known for efficient implementation of SpMV on GPU for wide spectrum of
sparse matrix pattern. Researchers have observed that performance of SpMV
on GPU for a given matrix A can vary widely depending on sparse storage
format used. Hence, it has become a great challenge to choose an appropriate
storage format from this collection for a given sparse matrix. To resolve this
problem, this paper proposes an algorithm that recommend highly suitable
storage format for a given sparse matrix. This system use simple metrics (like
row length, number of rows, number of columns and number of non-zero
element) of a given sparse matrix to analyse impact of different storage format
on performance of SpMV. To demonstrate influence of this algorithm,
performance of SpMV and its associated application − Conjugate Gradient
Solver (CGS) over various sparse matrix patterns with various sparse formats
have been compared.
Key words: Sparse Matrix, SpMV, Sparse format, Heuristics, K-mean
clustering and Load balance
Cite this Article: Shah, M. Sparse Storage Recommendation System for
Sparse Matrix Vector Multiplication on GPU. International Journal of
Advanced Research in Engineering and Technology, 6(7), 2015, pp. 11-23.
http://www.iaeme.com/currentissue.asp?JType=IJARET&VType=6&IType=7
_____________________________________________________________________
1. INTRODUCTION
Since last many years, Sparse Matrix Vector Multiplication (SpMV) has become most
prominent computing dwarf for science and engineering applications. Linear algebra
IJARET
Monika Shah
http://www.iaeme.com/IJARET/index.asp 12 [email protected]
solver (like partial differential equations [1], [2], conjugate gradient solver [3], [4],
Gaussian reduction of complex matrices, etc.), fluid dynamics [5], Database query
processing on large database (LDB) [6], information retrieval [7], network theory [8],
[9] , page rank computation [10], physics of disordered and quantum system [11] are
well-known applications that have recurrent use of SpMV. Sparse matrix used in
these applications are varied widely in non-zero pattern.
Continuous growth of computer users, and their increasing usage constantly increase
size of many datasets used in such applications. This continuous and exponential
growth of dataset has raised need to apply High Performance Computing. Researchers
have provide many solutions through inventions for high performance computing
device architectures like Graphical Processing Unit (GPU) and optimizing algorithms
for these devices. GPU device is well-known high performance promising device for
regular applications. Hence, it is a great challenge to use GPU for irregular
application like SpMV.
Generalized implementation of parallel SpMV has become complex because of
following properties of sparse matrix:
1. Imbalanced number of nonzero elements in each row
2. Imbalanced number of nonzero elements in each column • Wide-range of sparse
patterns (diagonal, skewed, power law distribution of non-zero elements for each
row, almost equal number of non-zero elements per row, block, etc.) • Varied sparse
level matrix(ratio of nonzero elements to size of matrix)
For an efficient and generalized implementation of SpMV on GPU, two important
factors are influenced by past research [12]: (i) Synchronization free load distribution
among computational resources, (ii) Reduce fetch operations to avoid drawback of
low latency memory access in GPU. Hence, it is preferred to select such sparse
storage format that support high compression along with better synchronization free
load distribution. Major challenge to satisfy these factors are:
1. Continuous growth in dataset make very large size of sparse matrix.
2. Indirection used in storage representation of sparse matrix increases size of data to be
transferred from CPU to GPU device as an additional overhead
3. Existence of large class of sparse matrix pattern
4. Difficult to balance work distribution due to imbalanced number of nonzero elements
for each row as well as for each column
5. Restriction to increase concurrency due to existence of data dependency among row
elements to compute output vector
Harnessing high computing capabilities of GPU, and unceasing performance
demand of SpMV kernel motivate researchers to optimize SpMV on GPU that deal
with all challenges listed above. During past research, Coordinate (COO),
Compressed Sparse Row (CSR), Compressed Sparse Column (CSC), ELLPACK
(ELL), Hybrid (HYB), and Jagged Diagonal Storage (JDS) have been proposed with
different compression strategy [13]. They have also proposed SpMV algorithms for
these sparse storage formats on GPU. Bulky index structure of COO format reduce
synchronization free load distribution degree among parallel threads, and Increase
communication overhead between CPU and GPU. CSC schedule all columns of a
sparse matrix sequentially in SPMV, and vector b is loaded and stored frequently in
each iteration. This factors are responsible for recurrent communication overhead,
which limits performance of CSC on GPU. These factors are responsible for less
popularity COO and CSC sparse formats on GPU. Aligned Coordinate (Aligned
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
http://www.iaeme.com/IJARET/index.asp 13 [email protected]
COO) [12] is introduced as compressed and suitable for synchronization free balanced
load distribution and proper cache utilization.
Sparse matrix metrics like Number of Rows (NR), Number of Columns (NC),
Number of Non-Zero elements (NNZ), Non-zero elements in a row (row_len), and
Non-zero elements in a column (col_len) are playing important role in compression
ratio, and parallel degree for various sparse formats. An important point to focus here
is compression ratio for these recognized sparse storage formats are varied based on
sparse level and sparse pattern of an input matrix. Considering these factors, this
paper proposes an algorithm to recommend highly suitable storage format fora given
sparse matrix. The remaining paper is structured as follows: The course of Optimizing
sparse formats and their SpMV implementation is traced in section II. Section III
brings forth our attempt to define heuristics and an algorithm that recommend highly
suitable storage format for implementing SpMV on GPU. Parallel algorithm of CGS
has been discussed in section III-D. Section IV demonstrates and analyse result of this
proposed work. Conclusion of the paper is described in section V.
2. SPARSE STRORAGE FORMATS
Many storage formats are proposed as a result of past effort by researchers. As
mentioned in section I, compressed storage, synchronization free load distribution,
and highest possible concurrency have become main goal to design sparse matrix
formats for NVIDIA GPU and CUDA programming environment. Bell at el.[13] have
introduced storage formats COO, CSR, CSC, ELL, HYB supporting different level of
compression to different sparse matrix pattern. Shah M. et al[12] have introduced an
Aligned COO. Many other extension to this benchmark sparse format [14], [15], [16],
[17] as well as hybrid of these storage formats [18], [19], [20] also have been
proposed in past. Tragedy even after research of this large set of sparse format is that
there is not any standard format suitable for almost all class of sparse matrix patterns.
In addition to this, it is also difficult to identify suitable sparse matrix format
supporting best compression as well as synchronization free and balanced work-load
distribution.
Table 1 Sparse Matrix Formats and Their Space Complexity
Sparse matrix format Space Complexity
COO NNZ x 3
CSC NNZ x 2 + ( NC +1)
CSR NNZ x 2 + (NR +1)
ELL (NR x max row length) x 2
HYB ≅ ELL, for rows with similar length
≅ COO, for rest row elements
Aligned COO Num_segments x Segment_length x 3
≅ (max_row_length x (≤ NR) x 3)
Selection of proper data compression strategy is important due to two major
reasons: (i) Data transferring overhead between CPU and GPU (ii) Design of Memory
access pattern for each concurrent thread depend on data structure. Table 1 presents
memory space required for various sparse formats. It sustain that compression
percentage for same format varies from one sparse matrix to another sparse matrix
based on basic statistics of the matrix. For example, COO provide highest
compression for small and highly sparse matrix; CSC and CSR give better
Monika Shah
http://www.iaeme.com/IJARET/index.asp 14 [email protected]
compression for small size of sparse matrix in terms of columns and rows
respectively; ELL is suitable for compression of sparse matrix with less difference in
NNZ for each row and large matrix in terms of number of rows. COO, CSC, CSR,
and ELL are known as core sparse storage formats designed to support higher
compression. HYB is designed to reduce padding space in ELL format. HYB suggest
better compression in form of hybrid pattern of ELL and COO. Aligned COO
provides better compression in compare to ELL for highly skewed sparse matrix with
power-law distribution.
Table 2 SPMV Algorithms and Their Time Complexity
SpMV Time Complexity
(Excluding Memory access Overhead)
COO_ flat
� ���max _�� ���_�ℎ �����
CSC
� max _�_���max _�� ���_�ℎ ����� � ���_��
CSR
≤ � ��max _concurrent_threads % � max _ &_���
CSR(vector)
≤ � ��max _&� '�� � (max _&� '�@ & + �+,(&� '_�-..
Where,
max _&� '� = �max _�� ���_�ℎ ����&� '_�0-� � and
max _&� '�_'� _ & = �max _ &_���warp_size �
ELL
≥ � ��max _concurrent_threads� � max _ &_���
HYB ≅ ELL, for rows with similar length
+ ≅ COO_flat, for rest row elements
Aligned_COO ≅ CSR, for aligned rows
+ ≅ COO_flat, for rest row elements
Increased concurrency and synchronization free load distribution are important
factors to reduce runtime of parallel SpMV on GPU. Table 2 represents rum-time
complexity of various SpMV implementation of above listed sparse storage formats.
COO_flat algorithm specify highest concurrency but does not ensure synchronization
free load distribution among concurrent threads due to row elements across warp
boundary. CSC is also less preferred due to an additional overhead of accessing
output vector in every iteration. On other side, ELL has an overhead of transferring
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
http://www.iaeme.com/IJARET/index.asp 15 [email protected]
extraneous memory containing padding of zero values over low latency memory
access. CSR and ELL has very similar SpMV algorithm except an additional
overhead by CSR to access memory to fetch row index. CSR implementation on GPU
is more efficient than ELL, where NNZ to be accessed by one thread block and
iteration is much larger than another block or iteration. CSR Vector provide much
higher concurrency than CSR and ELL, but has an overhead of performing series of
parallel reduction steps by each thread. Hence, CSR Vector is not suitable when
average NNZ per row is less than steps required by parallel reductions by each thread
that is log (warp size). HYB and Aligned COO kernels are designed to make efficient
SpMV using hybrid of above mentioned sparse formats and their kernels. Aligned
COO reorders nonzero elements to make balanced workload distribution among each
computing resource and thus reduce number of row segment compare to number of
rows in ELL format retaining maximum row length same as original. Hence, Aligned
COO give optimized performance for highly skewed sparse matrix pattern.
3. PROPOSED WORK
Section II discuss strengths and weaknesses of various sparse matrix storage formats
and its SpMV implementations. It indicates that selection of sparse storage format is
important factor for efficient SpMV on GPU. The collection of SpMV algorithms like
JAD, CSR, ELL, CSR vector, HYB, and Aligned COO cover wide spectrum of sparse
matrix pattern for better performance. Recognizing sparse matrix pattern is great
challenge. Statistics analysis is considered to be a good methodology to recognize
sparse patterns. Diagonal pattern is simple to recognized, and JAD is recommended
for diagonal pattern of sparse matrix. This section proposes a strategy to suggest most
appropriate SpMV implementation for all sparse pattern except diagonal.
Working flow of this proposed work is described in Figure 1. Here, K-mean
clustering is used to generate detailed statistics from basic matrix statistics like NR,
NC, NNZ and row length vector rl []. This derived statistics are analysed and
compared with pre-defined Heuristics and suggest most appropriate SpMV algorithm.
Section III-A explain input and output parameter of K-mean clustering algorithm.
Section III-B define heuristics for SpMV algorithms CSR, ELL, CSR Vector, HYB,
and Aligned COO algorithms.
Figure 1 Working flow of Heuristic based Selection of SpMV algorithm
Monika Shah
http://www.iaeme.com/IJARET/index.asp 16 [email protected]
Detailed description of Heuristics based SpMV selection algorithm is given in
section III-C. To prove effectiveness of this proposed algorithm, a well-known SpMV
application − CGS on GPU is implemented as shown in section III-D.
3.1. K-Mean Clustering And Its Parameters
Here, K-mean clustering is used to identify similarity level among sparse matrix rows
using parameter row-length. This K-mean clustering algorithm constructs 2 clusters
based on row length vector rl []. For highly skewed sparse matrix, centroid of cluster
is not sufficient to predict similarity of row-length. Hence, this K-mean clustering
algorithm is slightly modified and identify Lower Bound (LB), Upper Bound (UB),
Number of element (CNT) and Centroid (C) for both cluster beans. This clusters are
named as cluster H and cluster L based on their centroid value i.e. C_L < C_H.
Similarly, output parameters of this K-mean clustering algorithm are named as LB_H,
UB_H, CNT_H, C_H, LB_L, UB_L, CNT_L, and C_L as shown in Figure 2.
Figure 2 K-mean clustering for this proposed work
3.2. Heuristics
Based on empirical result analysis and basic understanding of various SpMV
algorithms, heuristics are defined to suggest a suitable sparse storage format and GPU
based SpMV algorithm capable to give better performance for given sparse matrix.
Following points are center of focus in design of this heuristics.
1. Obtain highest possible concurrency degree
2. Better compression of sparse matrix to reduce memory access cost
3. Balanced work load distribution among threads
4. Synchronization free load distribution as far as possible
5. Reducing number of blocks to reduce block schedule cost
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
http://www.iaeme.com/IJARET/index.asp
3.3. Heuristics for CSR Vector
CSR Vector is designed to propose highest pos
free load distribution, which in turn ensures good accuracy. Every execution thread of
this SpMV algorithm executes at
operations. For execution of CSR Vector, warp W (collect
32 threads in general) is allotted to each row of sparse matrix for computation. CSR
storage format is used to implement this SpMV algorithm. But, CSR is preferred for
small matrix size to avoid large number of low latency mem
row index. Looking all these criteria, CSR Vector is preferred to apply if following
condition is satisfied:
3.4. Heuristics for CSR
CSR SpMV is preferred for small matrix, where each row does not have large number
of non-zero elements as well as majority of rows do not have equivalent size in terms
of non-zero elements. Hence, CSR SpMV is preferred when CSR Vector is not
applicable and following condition is satisfied:
3.5. Heuristics for ELL
ELL storage format and ELL SpMV algo
equivalent row length, which reduces padding overhead and improve performance.
But, large row-length reduce concurrency degree in ELL SpMV. ELL is preferred
when there is not much difference between neither centr
between upper bound of higher value cluster and centroid value of cluster having
lower centroid value. Hence, it is concluded that ELLPACK is preferred, when CSR
or CSR Vector are not applicable for given sparse matrix, and fo
satisfied:
3.6. Heuristics for HYB
When a large sparse matrix does not have equivalent row length, but power
distribution of non-zero elements among rows of the matrix with highly skewed
visualization, Hybrid sparse format and its SpMV is preferred. Hence, it is concluded
that HYB is preferred, when CSR or ELL sparse formats are not suitable for given
sparse matrix, and following condition is satisfied:
(6_76_8 ≥ 100
3.7 Heuristics for Aligned COO
Aligned COO format and its SpMV are designed to optimize performance for large
sparse matrix having skewed distribution of non
alignment of large sized row with small si
of number of execution units. But as it is based on COO format, it provide less
compression in compare to hybrid. Hence, Aligned COO is preferred when neither
CSR nor ELL nor HYB formats are suitable as well as i
condition:
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
ARET/index.asp 17 [email protected]
Heuristics for CSR Vector
Vector is designed to propose highest possible concurrency with synchronization
free load distribution, which in turn ensures good accuracy. Every execution thread of
this SpMV algorithm executes at-least 1 Multiplication and log
operations. For execution of CSR Vector, warp W (collection of execution threads i.e.
32 threads in general) is allotted to each row of sparse matrix for computation. CSR
storage format is used to implement this SpMV algorithm. But, CSR is preferred for
small matrix size to avoid large number of low latency memory access for fetching
row index. Looking all these criteria, CSR Vector is preferred to apply if following
Heuristics for CSR
CSR SpMV is preferred for small matrix, where each row does not have large number
ments as well as majority of rows do not have equivalent size in terms
zero elements. Hence, CSR SpMV is preferred when CSR Vector is not
applicable and following condition is satisfied:
Heuristics for ELL
ELL storage format and ELL SpMV algorithm is preferred for a sparse matrix with
equivalent row length, which reduces padding overhead and improve performance.
length reduce concurrency degree in ELL SpMV. ELL is preferred
when there is not much difference between neither centroid value of two clusters nor
between upper bound of higher value cluster and centroid value of cluster having
lower centroid value. Hence, it is concluded that ELLPACK is preferred, when CSR
or CSR Vector are not applicable for given sparse matrix, and following condition is
.49) or (C_H-C_L) ≤ 6))
Heuristics for HYB
When a large sparse matrix does not have equivalent row length, but power
zero elements among rows of the matrix with highly skewed
Hybrid sparse format and its SpMV is preferred. Hence, it is concluded
that HYB is preferred, when CSR or ELL sparse formats are not suitable for given
sparse matrix, and following condition is satisfied:
100 . ;� (<=_78=_7 ≥ 100 . ;� (>>?
>@ ≥ 100 .
Heuristics for Aligned COO
Aligned COO format and its SpMV are designed to optimize performance for large
sparse matrix having skewed distribution of non-zero elements and also have possible
alignment of large sized row with small sized row such that it can reduce requirement
of number of execution units. But as it is based on COO format, it provide less
compression in compare to hybrid. Hence, Aligned COO is preferred when neither
CSR nor ELL nor HYB formats are suitable as well as it should satisfy following
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
sible concurrency with synchronization
free load distribution, which in turn ensures good accuracy. Every execution thread of
log2W addition
ion of execution threads i.e.
32 threads in general) is allotted to each row of sparse matrix for computation. CSR
storage format is used to implement this SpMV algorithm. But, CSR is preferred for
ory access for fetching
row index. Looking all these criteria, CSR Vector is preferred to apply if following
CSR SpMV is preferred for small matrix, where each row does not have large number
ments as well as majority of rows do not have equivalent size in terms
zero elements. Hence, CSR SpMV is preferred when CSR Vector is not
rithm is preferred for a sparse matrix with
equivalent row length, which reduces padding overhead and improve performance.
length reduce concurrency degree in ELL SpMV. ELL is preferred
oid value of two clusters nor
between upper bound of higher value cluster and centroid value of cluster having
lower centroid value. Hence, it is concluded that ELLPACK is preferred, when CSR
llowing condition is
6))
When a large sparse matrix does not have equivalent row length, but power-law
zero elements among rows of the matrix with highly skewed
Hybrid sparse format and its SpMV is preferred. Hence, it is concluded
that HYB is preferred, when CSR or ELL sparse formats are not suitable for given
Aligned COO format and its SpMV are designed to optimize performance for large
zero elements and also have possible
zed row such that it can reduce requirement
of number of execution units. But as it is based on COO format, it provide less
compression in compare to hybrid. Hence, Aligned COO is preferred when neither
t should satisfy following
http://www.iaeme.com/IJARET/index.asp
3.8. Heuristics based Sparse format recommendation
This section describes an algorithm to suggest most suitable sparse format and its
associated SpMV for given sparse matrix. It performs K
matrix metric, and compare its output parameters with heuristics defined in above
section.
Algorithm 1 Heuristics based Sparse format recommendation
Input: NNZ, NR, NC, rl [ ]
Output: Suitable_SpMV
Perform K-mean clustering with two bean
3.9. Parallel CGS
To demonstrate effectiveness of hereby proposed Heuristics based sparse format
recommendation algorithm for efficient SpMV, it is preferred to test this algorithm on
such application that has frequent and high usage of SpMV kernel as well as
applicable for wide category of sparse patterns. Conjugate Gradient Solver (CGS) is
such a well-known application that find solution vector x for Ax=b.
Every CGS call invokes SpMV kernel in a loop of hu
iterations. As GPU has major overhead of memory transfer between CPU and GPU,
this parallel CGS is designed such a way that it need to transfer sparse matrix A, and
input vector b at time of first iteration only. GPU based parallel CGS
Monika Shah
http://www.iaeme.com/IJARET/index.asp 18 [email protected]
Heuristics based Sparse format recommendation
This section describes an algorithm to suggest most suitable sparse format and its
associated SpMV for given sparse matrix. It performs K-mean clustering on sparse
atrix metric, and compare its output parameters with heuristics defined in above
Algorithm 1 Heuristics based Sparse format recommendation
NNZ, NR, NC, rl [ ]
mean clustering with two bean
(C_L ≥ log, E. AND (C_H ≥ W2 .
AND (NNZ ≤ max threads) then
To demonstrate effectiveness of hereby proposed Heuristics based sparse format
recommendation algorithm for efficient SpMV, it is preferred to test this algorithm on
has frequent and high usage of SpMV kernel as well as
applicable for wide category of sparse patterns. Conjugate Gradient Solver (CGS) is
known application that find solution vector x for Ax=b.
call invokes SpMV kernel in a loop of hundreds to thousands
iterations. As GPU has major overhead of memory transfer between CPU and GPU,
this parallel CGS is designed such a way that it need to transfer sparse matrix A, and
input vector b at time of first iteration only. GPU based parallel CGS
This section describes an algorithm to suggest most suitable sparse format and its
mean clustering on sparse
atrix metric, and compare its output parameters with heuristics defined in above
To demonstrate effectiveness of hereby proposed Heuristics based sparse format
recommendation algorithm for efficient SpMV, it is preferred to test this algorithm on
has frequent and high usage of SpMV kernel as well as
applicable for wide category of sparse patterns. Conjugate Gradient Solver (CGS) is
ndreds to thousands
iterations. As GPU has major overhead of memory transfer between CPU and GPU,
this parallel CGS is designed such a way that it need to transfer sparse matrix A, and
input vector b at time of first iteration only. GPU based parallel CGS is described in
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
http://www.iaeme.com/IJARET/index.asp 19 [email protected]
Algorithm 2, where SpMV kernel is executed for specified number of iterations or
size of sparse matrix which is in hundreds.
4. EXPERIMENTAL RESULT
For proper evaluation of this proposed algorithm, various SpMV algorithms like CSR,
CSR Vector, ELL, and HYB SpMV from NVIDIA cusp-library and Aligned COO
algorithms have been implemented on NVIDIA GPU.
Collection of sparse matrix used in this experiment is listed along with its basic
properties in Table 3.
4.1. Test Platform
These experiments have been executed on Intel(R) Core(TM) i3 CPU @ 3.20 GHz
with 4GB RAM, 2 × 256 KB (L2 Cache) and 4 MB (L3 Cache), and NVIDIA C2070
GPU device using CUDA version 4.0 on Ubuntu 11.
This dataset contains 31 sparse matrix, which are retrieved from well-known
source The University of Florida Sparse Matrix Collection. Selection of sparse matrix
are done such that the collection contain matrix with various sparse level and large
category of sparse patterns.
4.2. Result Analysis
Table 4 list performance of CSR, CSR Vector, ELL, HYB, and Aligned COO
algorithms in terms of GFLOPS per seconds for each matrix listed in Table 3.
Heuristics based SpMV selection algorithm is implemented and its output compared
with performance result recorded in Table 4. Overhead of memory transfer cost
between CPU and GPU is always important factor for overall performance. This cost
is considered to be amortized over large number of iterations. Parallel CGS algorithm
listed in Algorithm 2 also have been implemented for 200 iterations. Execution time
including memory transfer time of this CGS has been compared with result of our
proposed algorithm. Result of proposed algorithm is satisfied for 30 sparse matrix out
of 31 sparse matrix listed in the dataset.
Algorithm 2 Parallel CGS
Input: Sparse Matrix A, Vector b, NR, NC, iterations
Output: Vector x
Initialize vector dev_x = 0
Copy vectors from Host memory to Device memory
(b → dev_r, and b → dev_p)
Copy matrix from Host memory to Device memory
(A → dev_A)
Compute dev_rsold = dev_rT x dev_r using dev_inner_product(dev_r, dev_r, NR,
dev_rsold)
for i = 1 → min (iterations, NR ∗ NC) do
Initialize vector dev_Ap = 0
Monika Shah
http://www.iaeme.com/IJARET/index.asp 20 [email protected]
Perform Ap= A*p using
dev _SpMV (dev_A, dev_p, dev_Ap)
Perform rsold = pT * Ap using
dev_inner_product (dev_p, dev_Ap, NR, dev_rsold) dev_alpha = dev_rsold
Asynchronous computation of dev_x and dev_r Perform x += alpha*p using
dev_add_scalarMul(dev_alpha,dev_p,dev_x, 1,dev_x) Perform r-=alpha*Ap using
dev_add_scalarMul(dev_alpha,dev_Ap,dev_r, -1,dev_r) Compute dev_rsnew = dev_rT x
dev_r using
dev_inner_product(dev r,dev r,NR,dev_rsnew) Copy device rsnew dev_rsnew to host
rsnew rsnew
If √ ���& < 1�OPQ
Exit for
end if
Compute p= r + ((rsnew/rsold)*p) using
��R_���' =��R_ ���&
��R_ ���
dev_add_scalarMul(dev_temp,dev_p,dev_r, 1,dev_p)
dev_rsold = dev_rsnew
End for
Copy device vector to host vector dev x → x
Return
__________________________________________________________________
5. CONCLUSION
In this paper, various factors responsible for achieving higher performance of SpMV
on GPU for various sparse pattern have been discussed. It has been realized that some
decision making algorithm is required to suggest highest performance giving SpMV
algorithm, especially for those applications that use large sparse matrices having
variety of sparse pattern and having recurrent use of SpMV. Hereby proposed
algorithm perform statistical analysis of sparse pattern and provide approximately
97% successful result. This statistical result recommend use of such clustering based
Heuristics design for appropriate sparse format selection.
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
http://www.iaeme.com/IJARET/index.asp 21 [email protected]
Table 3 Sparse Matrix Collection Used In Experimentation
Matrix NR NC NNZ Sparsity %
NNZ / (NR*NC)
3D 51448_3D 51448 514484 1056610 0.00003992
add20 2395 2395 17319 0.00301934
add32 4960 4960 23884 0.00097083
adder dcop_19 1813 1813 11245 0.00342109
aircraft 3754 7517 20267 0.00071821
airfoil 4253 4253 24578 0.00135880
airfoil_2d 14214 14214 259688 0.00128534
aug3dcqp 35543 35543 128115 0.00010141
bayer01 57735 57735 277774 0.00008333
bcsstk36 23052 23052 1143140 0.00215121
bcsstm38 8032 8033 10485 0.00016251
bfwa782 782 783 7514 0.01227164
bips07_3078_iv 21128 21128 75729 0.00016965
Bloweybl 30003 30003 120000 0.00013331
c64b 51035 51035 717841 0.00027561
coater1 1348 1348 19457 0.01070770
crankseg_2 63838 63838 14148858 0.00347187
crashbasis 160000 160000 1750416 0.00683756
delaunay_n15 32768 32768 196548 0.00018305
epb0 1794 1794 7764 0.00241235
FEM_3D_ thermal1 17880 17880 430740 0.00134735
fpga_trans_01 1220 1220 7382 0.00495969
G2_circuit 150102 150102 726674 0.00003225
gupta1 31802 31802 2164210 0.00213989
Hamrle2 5952 5952 22162 0.00062558
jagmesh2 1009 1009 6865 0.00674308
jagmesh3 1089 1089 7361 0.00620699
lhr07 7337 7337 156508 0.00290737
lung2 109460 109460 492564 0.00004111
net100 29920 29920 2033200 0.00227121
Zd_Jac6 22835 22835 1711983 0.00328320
Monika Shah
http://www.iaeme.com/IJARET/index.asp 22 [email protected]
Table 4 Execution Performance of Various Spmv
Matrix Performance (GFLOP/sec)
CSR CSR Vector ELL HYB A_COO
3D 51448_3D 0.56 3.46 0.11 6.45 5.41
add20 0.34 0.5 0.22 0.15 0.15
add32 1.42 0.99 1.06 0.2 1.31
adder_dcop_19 0.1 0.38 0.02 0.09 0.1
aircraft 2.83 0.71 3.33 0.17 4.28
airfoil 2.95 0.41 3.6 0.2 3.86
airfoil_2d 1.17 2.03 10.99 6.64 11.2
aug3dcqp 4.51 0.24 6.58 1.01 4.87
bayer01 3.91 0.28 3.63 1.92 5.57
bcsstk36 0.66 7.95 2.32 6.04 4.81
bcsstm38 1.62 0.3 0.83 0.09 1
bfwa782 0.47 0.95 0.65 0.06 0.76
bips07_3078_iv 0.74 0.04 0.7 0.05 0.94
bloweybl 0.14 0.27 0.01 0.94 0.96
c64b 0.15 0.79 0.05 3.61 3.6
coater1 0.76 2.13 0.77 0.16 0.96
crankseg_2 0.61 8.39 2.28 11.5 6.37
crashbasis 2.09 2.06 16.21 16.18 11.06
delaunay_n15 3.31 1.31 5.7 1.55 4.26
epb0 0.8 0.54 1.07 0.06 1.19
FEM_3D_thermal1 0.87 4.44 14.63 14.59 10.01
Fpga_trans_01 0.35 0.88 0.31 0.06 0.06
G2_circuit 6.48 0.52 12.29 4.65 8.8
gupta1 .3 4.56 0.2 5.64 5.24
Hamrle2 3.26 0.8 4.18 0.19 4.63
jagmesh2 1.17 0.45 1.15 0.05 1.46
jagmesh3 1.25 0.44 1.25 0.06 1.51
lhr07 1.2 1.36 3.76 1.09 4.51
lung2 5.41 0.86 9.14 3.34 9.45
net100 0.65 6.59 7.11 5.78 6.74
Zd Jac6 1.82 2.93 1.82 4.14 4.31
REFERENCES
[1] Lee, I. Efficient sparse matrix vector multiplication using compressed graph, in
IEEE SoutheastCon 2010 (SoutheastCon), Proceedings of the, March 2010, pp.
328–331.
[2] Wang, H. C. and Hwang, K. Multicoloring for fast sparse matrix-vector
multiplication in solving pde problems, in Parallel Processing, 1993. ICPP 1993.
International Conference on, Vol. 3, Aug 1993, pp. 215–222.
[3] Jamroz, B. and Mullowney, P. Performance of parallel sparse matrix-vector
multiplications in linear solves on multiple gpus, in Application Accelerators in
High Performance Computing (SAAHPC), 2012 Symposium on, July 2012, pp.
149–152.
Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU
http://www.iaeme.com/IJARET/index.asp 23 [email protected]
[4] Hestenes, E. and Stiefel, M. R. Methods of conjugate gradients for solving linear
systems, 1952.
[5] van der Veen, M. Sparse matrix vector multiplication on a field programmable
gate array, September 2007.
[6] Ashany, R. Application of sparse matrix techniques to search, retrieval,
classification and relationship analysis in large data base systems − sparcom, in
Proceedings of the Fourth International Conference on Very Large Data Bases −
Volume 4, VLDB ’78, VLDB Endowment, 1978, pp. 499–516.
[7] Goharian, N., Grossman, D. and El-Ghazawi, T. Enterprise text processing: A
sparse matrix approach, Information Technology: Coding and Computing,
International Conference on, vol. 0, 2001.
[8] Bender, M. A., Brodal, G. S., Fagerberg, R., Jacob, R. and Vicari, E. Optimal
sparse matrix dense vector multiplication in the i/o-model, in Proceedings of the
Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures,
SPAA ’07, ACM, 2007.
[9] Manzini, G. Lower bounds for sparse matrix vector multiplication on hypercubic
networks, Vol. 2, 1998.
[10] Wu, T., Wang, B., Shan, Y., Yan, F., Wang, Y. and Xu, N. Efficient pagerank
and spmv computation on amd gpus, in ICPP, 2010, pp. 81–89,.
[11] Gan, Z. and Harrison, R. Calibrating quantum chemistry: A multi-teraflop,
parallel-vector, full-configuration interaction program for the cray-x1, in
Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference, Nov
2005
[12] Shah, M. and Patel, V. An efficient sparse matrix multiplication for skewed
matrix on gpu, in High Performance Computing and Communication 2012 IEEE
9th International Conference on Embedded Software and Systems (HPCC-
ICESS), 2012 IEEE 14th International Conference on, June 2012, pp. 1301–1306.
[13] Bell, N. and Garland, M. Implementing sparse matrix-vector multiplication on
throughput-oriented processors, in SC, 2009.
[14] Dziekonski, A., Lamecki, A. and Mrozowski, M. A memory efficient and fast
sparse matrix vector product on a gpu, Progress In Electromagnetics Research,
Vol. 116, 2011, pp. 49–63.
[15] Vazquez, F., Ortega, G., Fernandez, J. and Garzon, E. Improving the performance
of the sparse matrix vector product with gpus, Computer and Information
Technology, International Conference on, vol. 0, 2010.
[16] Pinar, A. and Heath, M. T. Improving performance of sparse matrix-vector
multiplication, in Proceedings of the 1999 ACM/IEEE conference on
Supercomputing (CDROM), Supercomputing ’99, 1999.
[17] Shahnaz, R. and Usman, A. Blocked-based sparse matrix-vector multiplication on
distributed memory parallel computers. Int. Arab J. Inf. Technol., 2011.
[18] Yang, X., Parthasarathy, S. and Sadayappan, P. Fast sparse matrix-vector
multiplication on gpus: Implications for graph mining. CoRR, vol. abs/1103.2405,
2011.
[19] Cao, W., Yao, L., Li, Z., Wang, Y. and Wang, Z. Implementing sparse matrix-
vector multiplication using cuda based on a hybrid sparse matrix format, in
International Conference on Computer Application and System Modeling, 2010.
[20] Choi, J. W., Singh, A. and Vuduc, R. W. Model-driven autotuning of sparse
matrix-vector multiply on gpus, in Proceedings of the 15th ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming, PPoPP ’10,
2010.