ijaret 06 07_003

http://www.iaeme.com/IJARET/index.asp 11 [email protected]

International Journal of Advanced Research in Engineering and Technology

(IJARET) Volume 6, Issue 7, Jul 2015, pp. 11-23, Article ID: IJARET_06_07_003

Available online at

http://www.iaeme.com/IJARET/issues.asp?JTypeIJARET&VType=6&IType=7

ISSN Print: 0976-6480 and ISSN Online: 0976-6499

© IAEME Publication

___________________________________________________________________________

SPARSE STORAGE RECOMMENDATION

SYSTEM FOR SPARSE MATRIX VECTOR

MULTIPLICATION ON GPU

Monika Shah

Department of Computer Science & Engineering, Nirma University

ABSTRACT

Sparse Matrix Vector Multiplication (SpMV) Ax=b is a well-known kernel

in science, engineering, and web world. Harnessing large computing

capabilities of GPU device, many sparse storage formats have been proposed

to optimize performance of SpMV on GPU. Compressed Sparse Row (CSR),

ELLPACK (ELL), Hybrid (HYB), and Aligned COO sparse storage formats

are known for efficient implementation of SpMV on GPU for wide spectrum of

sparse matrix pattern. Researchers have observed that performance of SpMV

on GPU for a given matrix A can vary widely depending on sparse storage

format used. Hence, it has become a great challenge to choose an appropriate

storage format from this collection for a given sparse matrix. To resolve this

problem, this paper proposes an algorithm that recommend highly suitable

storage format for a given sparse matrix. This system use simple metrics (like

row length, number of rows, number of columns and number of non-zero

element) of a given sparse matrix to analyse impact of different storage format

on performance of SpMV. To demonstrate influence of this algorithm,

performance of SpMV and its associated application − Conjugate Gradient

Solver (CGS) over various sparse matrix patterns with various sparse formats

have been compared.

Key words: Sparse Matrix, SpMV, Sparse format, Heuristics, K-mean

clustering and Load balance

Cite this Article: Shah, M. Sparse Storage Recommendation System for

Sparse Matrix Vector Multiplication on GPU. International Journal of

Advanced Research in Engineering and Technology, 6(7), 2015, pp. 11-23.

http://www.iaeme.com/currentissue.asp?JType=IJARET&VType=6&IType=7

_____________________________________________________________________

1. INTRODUCTION

Since last many years, Sparse Matrix Vector Multiplication (SpMV) has become most

prominent computing dwarf for science and engineering applications. Linear algebra

IJARET

Monika Shah


solver (like partial differential equations [1], [2], conjugate gradient solver [3], [4],

Gaussian reduction of complex matrices, etc.), fluid dynamics [5], Database query

processing on large database (LDB) [6], information retrieval [7], network theory [8],

[9] , page rank computation [10], physics of disordered and quantum system [11] are

well-known applications that have recurrent use of SpMV. Sparse matrix used in

these applications are varied widely in non-zero pattern.

Continuous growth of computer users, and their increasing usage constantly increase

size of many datasets used in such applications. This continuous and exponential

growth of dataset has raised need to apply High Performance Computing. Researchers

have provide many solutions through inventions for high performance computing

device architectures like Graphical Processing Unit (GPU) and optimizing algorithms

for these devices. GPU device is well-known high performance promising device for

regular applications. Hence, it is a great challenge to use GPU for irregular

application like SpMV.

Generalized implementation of parallel SpMV has become complex because of

following properties of sparse matrix:

1. Imbalanced number of nonzero elements in each row

2. Imbalanced number of nonzero elements in each column • Wide-range of sparse

patterns (diagonal, skewed, power law distribution of non-zero elements for each

row, almost equal number of non-zero elements per row, block, etc.) • Varied sparse

level matrix(ratio of nonzero elements to size of matrix)

For an efficient and generalized implementation of SpMV on GPU, two important

factors are influenced by past research [12]: (i) Synchronization free load distribution

among computational resources, (ii) Reduce fetch operations to avoid drawback of

low latency memory access in GPU. Hence, it is preferred to select such sparse

storage format that support high compression along with better synchronization free

load distribution. Major challenge to satisfy these factors are:

1. Continuous growth in dataset make very large size of sparse matrix.

2. Indirection used in storage representation of sparse matrix increases size of data to be

transferred from CPU to GPU device as an additional overhead

3. Existence of large class of sparse matrix pattern

4. Difficult to balance work distribution due to imbalanced number of nonzero elements

for each row as well as for each column

5. Restriction to increase concurrency due to existence of data dependency among row

elements to compute output vector

Harnessing high computing capabilities of GPU, and unceasing performance

demand of SpMV kernel motivate researchers to optimize SpMV on GPU that deal

with all challenges listed above. During past research, Coordinate (COO),

Compressed Sparse Row (CSR), Compressed Sparse Column (CSC), ELLPACK

(ELL), Hybrid (HYB), and Jagged Diagonal Storage (JDS) have been proposed with

different compression strategy [13]. They have also proposed SpMV algorithms for

these sparse storage formats on GPU. Bulky index structure of COO format reduce

synchronization free load distribution degree among parallel threads, and Increase

communication overhead between CPU and GPU. CSC schedule all columns of a

sparse matrix sequentially in SPMV, and vector b is loaded and stored frequently in

each iteration. This factors are responsible for recurrent communication overhead,

which limits performance of CSC on GPU. These factors are responsible for less

popularity COO and CSC sparse formats on GPU. Aligned Coordinate (Aligned

Sparse Storage Recommendation System for Sparse Matrix Vector Multiplication on GPU


COO) [12] is introduced as compressed and suitable for synchronization free balanced

load distribution and proper cache utilization.

Sparse matrix metrics like Number of Rows (NR), Number of Columns (NC),

Number of Non-Zero elements (NNZ), Non-zero elements in a row (row_len), and

Non-zero elements in a column (col_len) are playing important role in compression

ratio, and parallel degree for various sparse formats. An important point to focus here

is compression ratio for these recognized sparse storage formats are varied based on

sparse level and sparse pattern of an input matrix. Considering these factors, this

paper proposes an algorithm to recommend highly suitable storage format fora given

sparse matrix. The remaining paper is structured as follows: The course of Optimizing

sparse formats and their SpMV implementation is traced in section II. Section III

brings forth our attempt to define heuristics and an algorithm that recommend highly

suitable storage format for implementing SpMV on GPU. Parallel algorithm of CGS

has been discussed in section III-D. Section IV demonstrates and analyse result of this

proposed work. Conclusion of the paper is described in section V.

2. SPARSE STRORAGE FORMATS

Many storage formats are proposed as a result of past effort by researchers. As

mentioned in section I, compressed storage, synchronization free load distribution,

and highest possible concurrency have become main goal to design sparse matrix

formats for NVIDIA GPU and CUDA programming environment. Bell at el.[13] have

introduced storage formats COO, CSR, CSC, ELL, HYB supporting different level of

compression to different sparse matrix pattern. Shah M. et al[12] have introduced an

Aligned COO. Many other extension to this benchmark sparse format [14], [15], [16],

[17] as well as hybrid of these storage formats [18], [19], [20] also have been

proposed in past. Tragedy even after research of this large set of sparse format is that

there is not any standard format suitable for almost all class of sparse matrix patterns.

In addition to this, it is also difficult to identify suitable sparse matrix format

supporting best compression as well as synchronization free and balanced work-load

distribution.

Table 1 Sparse Matrix Formats and Their Space Complexity

Sparse matrix format Space Complexity

COO NNZ x 3

CSC NNZ x 2 + ( NC +1)

CSR NNZ x 2 + (NR +1)

ELL (NR x max row length) x 2

HYB ≅ ELL, for rows with similar length

≅ COO, for rest row elements

Aligned COO Num_segments x Segment_length x 3

≅ (max_row_length x (≤ NR) x 3)

Selection of proper data compression strategy is important due to two major

reasons: (i) Data transferring overhead between CPU and GPU (ii) Design of Memory

access pattern for each concurrent thread depend on data structure. Table 1 presents

memory space required for various sparse formats. It sustain that compression

percentage for same format varies from one sparse matrix to another sparse matrix

based on basic statistics of the matrix. For example, COO provide highest

compression for small and highly sparse matrix; CSC and CSR give better

Monika Shah


compression for small size of sparse matrix in terms of columns and rows

respectively; ELL is suitable for compression of sparse matrix with less difference in

NNZ for each row and large matrix in terms of number of rows. COO, CSC, CSR,

and ELL are known as core sparse storage formats designed to support higher

compression. HYB is designed to reduce padding space in ELL format. HYB suggest

better compression in form of hybrid pattern of ELL and COO. Aligned COO

provides better compression in compare to ELL for highly skewed sparse matrix with

power-law distribution.

Table 2 SPMV Algorithms and Their Time Complexity

SpMV Time Complexity

(Excluding Memory access Overhead)

COO_ flat

� ��max _�� _�ℎ ��

CSC

� max _�_��max _�� _�ℎ �� _��

CSR

≤ � ��max _concurrent_threads % � max _ &_��

CSR(vector)

≤ � ��max _&� '�� (max _&� '�@ & + �+,(&� '_�-..

Where,

max _&� '� = �max _�� _�ℎ ��&� '_�0-� � and

max _&� '�_'� _ & = �max _ &_��warp_size �

ELL

≥ � ��max _concurrent_threads� � max _ &_��

HYB ≅ ELL, for rows with similar length

+ ≅ COO_flat, for rest row elements

Aligned_COO ≅ CSR, for aligned rows

+ ≅ COO_flat, for rest row elements

Increased concurrency and synchronization free load distribution are important

factors to reduce runtime of parallel SpMV on GPU. Table 2 represents rum-time

complexity of various SpMV implementation of above listed sparse storage formats.

COO_flat algorithm specify highest concurrency but does not ensure synchronization

free load distribution among concurrent threads due to row elements across warp

boundary. CSC is also less preferred due to an additional overhead of accessing

output vector in every iteration. On other side, ELL has an overhead of transferring



extraneous memory containing padding of zero values over low latency memory

access. CSR and ELL has very similar SpMV algorithm except an additional

overhead by CSR to access memory to fetch row index. CSR implementation on GPU

is more efficient than ELL, where NNZ to be accessed by one thread block and

iteration is much larger than another block or iteration. CSR Vector provide much

higher concurrency than CSR and ELL, but has an overhead of performing series of

parallel reduction steps by each thread. Hence, CSR Vector is not suitable when

average NNZ per row is less than steps required by parallel reductions by each thread

that is log (warp size). HYB and Aligned COO kernels are designed to make efficient

SpMV using hybrid of above mentioned sparse formats and their kernels. Aligned

COO reorders nonzero elements to make balanced workload distribution among each

computing resource and thus reduce number of row segment compare to number of

rows in ELL format retaining maximum row length same as original. Hence, Aligned

COO give optimized performance for highly skewed sparse matrix pattern.

3. PROPOSED WORK

Section II discuss strengths and weaknesses of various sparse matrix storage formats

and its SpMV implementations. It indicates that selection of sparse storage format is

important factor for efficient SpMV on GPU. The collection of SpMV algorithms like

JAD, CSR, ELL, CSR vector, HYB, and Aligned COO cover wide spectrum of sparse

matrix pattern for better performance. Recognizing sparse matrix pattern is great

challenge. Statistics analysis is considered to be a good methodology to recognize

sparse patterns. Diagonal pattern is simple to recognized, and JAD is recommended

for diagonal pattern of sparse matrix. This section proposes a strategy to suggest most

appropriate SpMV implementation for all sparse pattern except diagonal.

Working flow of this proposed work is described in Figure 1. Here, K-mean

clustering is used to generate detailed statistics from basic matrix statistics like NR,

NC, NNZ and row length vector rl []. This derived statistics are analysed and

compared with pre-defined Heuristics and suggest most appropriate SpMV algorithm.

Section III-A explain input and output parameter of K-mean clustering algorithm.

Section III-B define heuristics for SpMV algorithms CSR, ELL, CSR Vector, HYB,

and Aligned COO algorithms.

Figure 1 Working flow of Heuristic based Selection of SpMV algorithm

Monika Shah


Detailed description of Heuristics based SpMV selection algorithm is given in

section III-C. To prove effectiveness of this proposed algorithm, a well-known SpMV

application − CGS on GPU is implemented as shown in section III-D.

3.1. K-Mean Clustering And Its Parameters

Here, K-mean clustering is used to identify similarity level among sparse matrix rows

using parameter row-length. This K-mean clustering algorithm constructs 2 clusters

based on row length vector rl []. For highly skewed sparse matrix, centroid of cluster

is not sufficient to predict similarity of row-length. Hence, this K-mean clustering

algorithm is slightly modified and identify Lower Bound (LB), Upper Bound (UB),

Number of element (CNT) and Centroid (C) for both cluster beans. This clusters are

named as cluster H and cluster L based on their centroid value i.e. C_L < C_H.

Similarly, output parameters of this K-mean clustering algorithm are named as LB_H,

UB_H, CNT_H, C_H, LB_L, UB_L, CNT_L, and C_L as shown in Figure 2.

Figure 2 K-mean clustering for this proposed work

3.2. Heuristics

Based on empirical result analysis and basic understanding of various SpMV

algorithms, heuristics are defined to suggest a suitable sparse storage format and GPU

based SpMV algorithm capable to give better performance for given sparse matrix.

Following points are center of focus in design of this heuristics.

1. Obtain highest possible concurrency degree

2. Better compression of sparse matrix to reduce memory access cost

3. Balanced work load distribution among threads

4. Synchronization free load distribution as far as possible

5. Reducing number of blocks to reduce block schedule cost


http://www.iaeme.com/IJARET/index.asp

3.3. Heuristics for CSR Vector

CSR Vector is designed to propose highest pos

free load distribution, which in turn ensures good accuracy. Every execution thread of

this SpMV algorithm executes at

operations. For execution of CSR Vector, warp W (collect

32 threads in general) is allotted to each row of sparse matrix for computation. CSR

storage format is used to implement this SpMV algorithm. But, CSR is preferred for

small matrix size to avoid large number of low latency mem

row index. Looking all these criteria, CSR Vector is preferred to apply if following

condition is satisfied:

3.4. Heuristics for CSR

CSR SpMV is preferred for small matrix, where each row does not have large number

of non-zero elements as well as majority of rows do not have equivalent size in terms

of non-zero elements. Hence, CSR SpMV is preferred when CSR Vector is not

applicable and following condition is satisfied:

3.5. Heuristics for ELL

ELL storage format and ELL SpMV algo

equivalent row length, which reduces padding overhead and improve performance.

But, large row-length reduce concurrency degree in ELL SpMV. ELL is preferred

when there is not much difference between neither centr

between upper bound of higher value cluster and centroid value of cluster having

lower centroid value. Hence, it is concluded that ELLPACK is preferred, when CSR

or CSR Vector are not applicable for given sparse matrix, and fo

satisfied:

3.6. Heuristics for HYB

When a large sparse matrix does not have equivalent row length, but power

distribution of non-zero elements among rows of the matrix with highly skewed

visualization, Hybrid sparse format and its SpMV is preferred. Hence, it is concluded

that HYB is preferred, when CSR or ELL sparse formats are not suitable for given

sparse matrix, and following condition is satisfied:

(6_76_8 ≥ 100

3.7 Heuristics for Aligned COO

Aligned COO format and its SpMV are designed to optimize performance for large

sparse matrix having skewed distribution of non

alignment of large sized row with small si

of number of execution units. But as it is based on COO format, it provide less

compression in compare to hybrid. Hence, Aligned COO is preferred when neither

CSR nor ELL nor HYB formats are suitable as well as i

condition:


ARET/index.asp 17 [email protected]

Heuristics for CSR Vector

Vector is designed to propose highest possible concurrency with synchronization


this SpMV algorithm executes at-least 1 Multiplication and log

operations. For execution of CSR Vector, warp W (collection of execution threads i.e.



small matrix size to avoid large number of low latency memory access for fetching


Heuristics for CSR


ments as well as majority of rows do not have equivalent size in terms

zero elements. Hence, CSR SpMV is preferred when CSR Vector is not

applicable and following condition is satisfied:

Heuristics for ELL

ELL storage format and ELL SpMV algorithm is preferred for a sparse matrix with


length reduce concurrency degree in ELL SpMV. ELL is preferred

when there is not much difference between neither centroid value of two clusters nor



or CSR Vector are not applicable for given sparse matrix, and following condition is

.49) or (C_H-C_L) ≤ 6))

Heuristics for HYB

When a large sparse matrix does not have equivalent row length, but power

zero elements among rows of the matrix with highly skewed

Hybrid sparse format and its SpMV is preferred. Hence, it is concluded


sparse matrix, and following condition is satisfied:

100 . ;� (<=_78=_7 ≥ 100 . ;� (>>?

>@ ≥ 100 .

Heuristics for Aligned COO


sparse matrix having skewed distribution of non-zero elements and also have possible

alignment of large sized row with small sized row such that it can reduce requirement



CSR nor ELL nor HYB formats are suitable as well as it should satisfy following


[email protected]

sible concurrency with synchronization


log2W addition

ion of execution threads i.e.



ory access for fetching



ments as well as majority of rows do not have equivalent size in terms

zero elements. Hence, CSR SpMV is preferred when CSR Vector is not

rithm is preferred for a sparse matrix with


length reduce concurrency degree in ELL SpMV. ELL is preferred

oid value of two clusters nor



llowing condition is

6))

When a large sparse matrix does not have equivalent row length, but power-law

zero elements among rows of the matrix with highly skewed

Hybrid sparse format and its SpMV is preferred. Hence, it is concluded



zero elements and also have possible

zed row such that it can reduce requirement



t should satisfy following

http://www.iaeme.com/IJARET/index.asp

3.8. Heuristics based Sparse format recommendation

This section describes an algorithm to suggest most suitable sparse format and its

associated SpMV for given sparse matrix. It performs K

matrix metric, and compare its output parameters with heuristics defined in above

section.

Algorithm 1 Heuristics based Sparse format recommendation

Input: NNZ, NR, NC, rl [ ]

Output: Suitable_SpMV

Perform K-mean clustering with two bean

3.9. Parallel CGS

To demonstrate effectiveness of hereby proposed Heuristics based sparse format

recommendation algorithm for efficient SpMV, it is preferred to test this algorithm on

such application that has frequent and high usage of SpMV kernel as well as

applicable for wide category of sparse patterns. Conjugate Gradient Solver (CGS) is

such a well-known application that find solution vector x for Ax=b.

Every CGS call invokes SpMV kernel in a loop of hu

iterations. As GPU has major overhead of memory transfer between CPU and GPU,

this parallel CGS is designed such a way that it need to transfer sparse matrix A, and

input vector b at time of first iteration only. GPU based parallel CGS

Monika Shah


Heuristics based Sparse format recommendation


associated SpMV for given sparse matrix. It performs K-mean clustering on sparse

atrix metric, and compare its output parameters with heuristics defined in above

Algorithm 1 Heuristics based Sparse format recommendation

NNZ, NR, NC, rl [ ]

mean clustering with two bean

(C_L ≥ log, E. AND (C_H ≥ W2 .

AND (NNZ ≤ max threads) then



has frequent and high usage of SpMV kernel as well as


known application that find solution vector x for Ax=b.

call invokes SpMV kernel in a loop of hundreds to thousands



input vector b at time of first iteration only. GPU based parallel CGS

[email protected]


mean clustering on sparse

atrix metric, and compare its output parameters with heuristics defined in above



has frequent and high usage of SpMV kernel as well as


ndreds to thousands



input vector b at time of first iteration only. GPU based parallel CGS is described in



Algorithm 2, where SpMV kernel is executed for specified number of iterations or

size of sparse matrix which is in hundreds.

4. EXPERIMENTAL RESULT

For proper evaluation of this proposed algorithm, various SpMV algorithms like CSR,

CSR Vector, ELL, and HYB SpMV from NVIDIA cusp-library and Aligned COO

algorithms have been implemented on NVIDIA GPU.

Collection of sparse matrix used in this experiment is listed along with its basic

properties in Table 3.

4.1. Test Platform

These experiments have been executed on Intel(R) Core(TM) i3 CPU @ 3.20 GHz

with 4GB RAM, 2 × 256 KB (L2 Cache) and 4 MB (L3 Cache), and NVIDIA C2070

GPU device using CUDA version 4.0 on Ubuntu 11.

This dataset contains 31 sparse matrix, which are retrieved from well-known

source The University of Florida Sparse Matrix Collection. Selection of sparse matrix

are done such that the collection contain matrix with various sparse level and large

category of sparse patterns.

4.2. Result Analysis

Table 4 list performance of CSR, CSR Vector, ELL, HYB, and Aligned COO

algorithms in terms of GFLOPS per seconds for each matrix listed in Table 3.

Heuristics based SpMV selection algorithm is implemented and its output compared

with performance result recorded in Table 4. Overhead of memory transfer cost

between CPU and GPU is always important factor for overall performance. This cost

is considered to be amortized over large number of iterations. Parallel CGS algorithm

listed in Algorithm 2 also have been implemented for 200 iterations. Execution time

including memory transfer time of this CGS has been compared with result of our

proposed algorithm. Result of proposed algorithm is satisfied for 30 sparse matrix out

of 31 sparse matrix listed in the dataset.

Algorithm 2 Parallel CGS

Input: Sparse Matrix A, Vector b, NR, NC, iterations

Output: Vector x

Initialize vector dev_x = 0

Copy vectors from Host memory to Device memory

(b → dev_r, and b → dev_p)

Copy matrix from Host memory to Device memory

(A → dev_A)

Compute dev_rsold = dev_rT x dev_r using dev_inner_product(dev_r, dev_r, NR,

dev_rsold)

for i = 1 → min (iterations, NR ∗ NC) do

Initialize vector dev_Ap = 0

Monika Shah


Perform Ap= A*p using

dev _SpMV (dev_A, dev_p, dev_Ap)

Perform rsold = pT * Ap using

dev_inner_product (dev_p, dev_Ap, NR, dev_rsold) dev_alpha = dev_rsold

Asynchronous computation of dev_x and dev_r Perform x += alpha*p using

dev_add_scalarMul(dev_alpha,dev_p,dev_x, 1,dev_x) Perform r-=alpha*Ap using

dev_add_scalarMul(dev_alpha,dev_Ap,dev_r, -1,dev_r) Compute dev_rsnew = dev_rT x

dev_r using

dev_inner_product(dev r,dev r,NR,dev_rsnew) Copy device rsnew dev_rsnew to host

rsnew rsnew

If √ ��& < 1�OPQ

Exit for

end if

Compute p= r + ((rsnew/rsold)*p) using

��R_��' =��R_ ��&

��R_ ��

dev_add_scalarMul(dev_temp,dev_p,dev_r, 1,dev_p)

dev_rsold = dev_rsnew

End for

Copy device vector to host vector dev x → x

Return

__________________________________________________________________

5. CONCLUSION

In this paper, various factors responsible for achieving higher performance of SpMV

on GPU for various sparse pattern have been discussed. It has been realized that some

decision making algorithm is required to suggest highest performance giving SpMV

algorithm, especially for those applications that use large sparse matrices having

variety of sparse pattern and having recurrent use of SpMV. Hereby proposed

algorithm perform statistical analysis of sparse pattern and provide approximately

97% successful result. This statistical result recommend use of such clustering based

Heuristics design for appropriate sparse format selection.



Table 3 Sparse Matrix Collection Used In Experimentation

Matrix NR NC NNZ Sparsity %

NNZ / (NR*NC)

3D 51448_3D 51448 514484 1056610 0.00003992

add20 2395 2395 17319 0.00301934

add32 4960 4960 23884 0.00097083

adder dcop_19 1813 1813 11245 0.00342109

aircraft 3754 7517 20267 0.00071821

airfoil 4253 4253 24578 0.00135880

airfoil_2d 14214 14214 259688 0.00128534

aug3dcqp 35543 35543 128115 0.00010141

bayer01 57735 57735 277774 0.00008333

bcsstk36 23052 23052 1143140 0.00215121

bcsstm38 8032 8033 10485 0.00016251

bfwa782 782 783 7514 0.01227164

bips07_3078_iv 21128 21128 75729 0.00016965

Bloweybl 30003 30003 120000 0.00013331

c64b 51035 51035 717841 0.00027561

coater1 1348 1348 19457 0.01070770

crankseg_2 63838 63838 14148858 0.00347187

crashbasis 160000 160000 1750416 0.00683756

delaunay_n15 32768 32768 196548 0.00018305

epb0 1794 1794 7764 0.00241235

FEM_3D_ thermal1 17880 17880 430740 0.00134735

fpga_trans_01 1220 1220 7382 0.00495969

G2_circuit 150102 150102 726674 0.00003225

gupta1 31802 31802 2164210 0.00213989

Hamrle2 5952 5952 22162 0.00062558

jagmesh2 1009 1009 6865 0.00674308

jagmesh3 1089 1089 7361 0.00620699

lhr07 7337 7337 156508 0.00290737

lung2 109460 109460 492564 0.00004111

net100 29920 29920 2033200 0.00227121

Zd_Jac6 22835 22835 1711983 0.00328320

Monika Shah


Table 4 Execution Performance of Various Spmv

Matrix Performance (GFLOP/sec)

CSR CSR Vector ELL HYB A_COO

3D 51448_3D 0.56 3.46 0.11 6.45 5.41

add20 0.34 0.5 0.22 0.15 0.15

add32 1.42 0.99 1.06 0.2 1.31

adder_dcop_19 0.1 0.38 0.02 0.09 0.1

aircraft 2.83 0.71 3.33 0.17 4.28

airfoil 2.95 0.41 3.6 0.2 3.86

airfoil_2d 1.17 2.03 10.99 6.64 11.2

aug3dcqp 4.51 0.24 6.58 1.01 4.87

bayer01 3.91 0.28 3.63 1.92 5.57

bcsstk36 0.66 7.95 2.32 6.04 4.81

bcsstm38 1.62 0.3 0.83 0.09 1

bfwa782 0.47 0.95 0.65 0.06 0.76

bips07_3078_iv 0.74 0.04 0.7 0.05 0.94

bloweybl 0.14 0.27 0.01 0.94 0.96

c64b 0.15 0.79 0.05 3.61 3.6

coater1 0.76 2.13 0.77 0.16 0.96

crankseg_2 0.61 8.39 2.28 11.5 6.37

crashbasis 2.09 2.06 16.21 16.18 11.06

delaunay_n15 3.31 1.31 5.7 1.55 4.26

epb0 0.8 0.54 1.07 0.06 1.19

FEM_3D_thermal1 0.87 4.44 14.63 14.59 10.01

Fpga_trans_01 0.35 0.88 0.31 0.06 0.06

G2_circuit 6.48 0.52 12.29 4.65 8.8

gupta1 .3 4.56 0.2 5.64 5.24

Hamrle2 3.26 0.8 4.18 0.19 4.63

jagmesh2 1.17 0.45 1.15 0.05 1.46

jagmesh3 1.25 0.44 1.25 0.06 1.51

lhr07 1.2 1.36 3.76 1.09 4.51

lung2 5.41 0.86 9.14 3.34 9.45

net100 0.65 6.59 7.11 5.78 6.74

Zd Jac6 1.82 2.93 1.82 4.14 4.31

REFERENCES

[1] Lee, I. Efficient sparse matrix vector multiplication using compressed graph, in

IEEE SoutheastCon 2010 (SoutheastCon), Proceedings of the, March 2010, pp.

328–331.

[2] Wang, H. C. and Hwang, K. Multicoloring for fast sparse matrix-vector

multiplication in solving pde problems, in Parallel Processing, 1993. ICPP 1993.

International Conference on, Vol. 3, Aug 1993, pp. 215–222.

[3] Jamroz, B. and Mullowney, P. Performance of parallel sparse matrix-vector

multiplications in linear solves on multiple gpus, in Application Accelerators in

High Performance Computing (SAAHPC), 2012 Symposium on, July 2012, pp.

149–152.



[4] Hestenes, E. and Stiefel, M. R. Methods of conjugate gradients for solving linear

systems, 1952.

[5] van der Veen, M. Sparse matrix vector multiplication on a field programmable

gate array, September 2007.

[6] Ashany, R. Application of sparse matrix techniques to search, retrieval,

classification and relationship analysis in large data base systems − sparcom, in

Proceedings of the Fourth International Conference on Very Large Data Bases −

Volume 4, VLDB ’78, VLDB Endowment, 1978, pp. 499–516.

[7] Goharian, N., Grossman, D. and El-Ghazawi, T. Enterprise text processing: A

sparse matrix approach, Information Technology: Coding and Computing,

International Conference on, vol. 0, 2001.

[8] Bender, M. A., Brodal, G. S., Fagerberg, R., Jacob, R. and Vicari, E. Optimal

sparse matrix dense vector multiplication in the i/o-model, in Proceedings of the

Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures,

SPAA ’07, ACM, 2007.

[9] Manzini, G. Lower bounds for sparse matrix vector multiplication on hypercubic

networks, Vol. 2, 1998.

[10] Wu, T., Wang, B., Shan, Y., Yan, F., Wang, Y. and Xu, N. Efficient pagerank

and spmv computation on amd gpus, in ICPP, 2010, pp. 81–89,.

[11] Gan, Z. and Harrison, R. Calibrating quantum chemistry: A multi-teraflop,

parallel-vector, full-configuration interaction program for the cray-x1, in

Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference, Nov

2005

[12] Shah, M. and Patel, V. An efficient sparse matrix multiplication for skewed

matrix on gpu, in High Performance Computing and Communication 2012 IEEE

9th International Conference on Embedded Software and Systems (HPCC-

ICESS), 2012 IEEE 14th International Conference on, June 2012, pp. 1301–1306.

[13] Bell, N. and Garland, M. Implementing sparse matrix-vector multiplication on

throughput-oriented processors, in SC, 2009.

[14] Dziekonski, A., Lamecki, A. and Mrozowski, M. A memory efficient and fast

sparse matrix vector product on a gpu, Progress In Electromagnetics Research,

Vol. 116, 2011, pp. 49–63.

[15] Vazquez, F., Ortega, G., Fernandez, J. and Garzon, E. Improving the performance

of the sparse matrix vector product with gpus, Computer and Information

Technology, International Conference on, vol. 0, 2010.

[16] Pinar, A. and Heath, M. T. Improving performance of sparse matrix-vector

multiplication, in Proceedings of the 1999 ACM/IEEE conference on

Supercomputing (CDROM), Supercomputing ’99, 1999.

[17] Shahnaz, R. and Usman, A. Blocked-based sparse matrix-vector multiplication on

distributed memory parallel computers. Int. Arab J. Inf. Technol., 2011.

[18] Yang, X., Parthasarathy, S. and Sadayappan, P. Fast sparse matrix-vector

multiplication on gpus: Implications for graph mining. CoRR, vol. abs/1103.2405,

2011.

[19] Cao, W., Yao, L., Li, Z., Wang, Y. and Wang, Z. Implementing sparse matrix-

vector multiplication using cuda based on a hybrid sparse matrix format, in

International Conference on Computer Application and System Modeling, 2010.

[20] Choi, J. W., Singh, A. and Vuduc, R. W. Model-driven autotuning of sparse

matrix-vector multiply on gpus, in Proceedings of the 15th ACM SIGPLAN

Symposium on Principles and Practice of Parallel Programming, PPoPP ’10,

2010.