sparse lu factorization for parallel circuit simulation on gpu ling ren, xiaoming chen, yu wang,...

37
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University DAC 2012

Upload: alaina-davidson

Post on 17-Dec-2015

225 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Sparse LU Factorization for Parallel Circuit Simulation on

GPULing Ren, Xiaoming Chen, Yu Wang, Chenxi

Zhang, Huazhong Yang

Department of Electronic Engineering, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University

DAC 2012

Page 2: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Outline

• Introduction• Preliminaries• Sparse LU factorization on GPU• Experimental results• Conclusions

2

Page 3: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Introduction

• Flowchart of a SPICE simulator

3

Page 4: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Introduction (cont.)

• SPICE takes several days or even weeks on simulation for modern designs.– The sparse matrix solver by LU factorization is

performed iteratively and hence time-consuming.

• However, it is difficult to parallelize the sparse solver because of the high data-dependency during the numeric LU factorization and the irregular structure of circuit matrices.

4

Page 5: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Introduction (cont.)• Previous works focus on dense matrices.

– Factorize a sparse matrix by a highly parallelize dense solver is still much slower than sequential sparse solver.

• [8]~[13] compute dense blocks on GPU but the rest are on CPU.– Still the dense idea.

• [15]~[17] apply G/P left-looking algorithm on FPGA.– Scalability is limited because FPGA on-

chip resources.• [18] implements it on multi-core CPU.

– Scalability is limited by the number of cores.

5

Page 6: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Introduction (cont.)

• Multi/Many-core era has come.• Graphic Processing Unit (GPU) can

now be used to perform general computing.– Become popular in parallel processing for

its cost-effectiveness.• The state of the art GPUs provide a

possible solution to the limited scalability.

• For now, the latest nVidia GeForce GTX 690 has large number of cores and memory. 6

Page 7: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

GeForce GTX 690 official spec

7

Page 8: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Contributions

• Exposing more parallelism for many-core architecture.

• Ensuring timing order on GPU.• Optimizing memory access

pattern.

8

Page 9: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Preliminaries

• Sparse matrix• LU factorization (decomposition)• GPU architecture and CUDA.

9

Page 10: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Sparse matrix

• The number of nonzero elements in a sparse matrix is about in .

• An example of adjacency matrix.

10

[0 0 0 0 1 0 0 00 0 0 0 1 0 0 00 0 0 0 0 1 0 00 0 0 0 0 1 0 01 1 0 0 0 0 1 00 0 1 1 0 0 1 00 0 0 0 1 1 0 10 0 0 0 0 0 1 0

]

Page 11: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

LU factorization

• Or called LU decomposition.• where A is a square matrix, L is

a lower-triangular matrix, and U is a upper-triangular matrix.

11

Page 12: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

CUDA programming

• Compute Unified Device Architecture

• The CPU code does the sequential part.

• Highly parallelized part usually implement in the GPU code, called kernel.

• Calling GPU function in CPU code is called kernel launch.

12

Page 13: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Execution of GPU thread

• Threads are grouped into thread blocks.

• Each thread block is assigned to a streaming multiprocessor (SM), which contains multiple streaming processors (SPs), to be executed.

• The actual execution of threads on SPs is done in groups of 32 threads, called warps.

• SPs execute one warp at a time.

13

Page 14: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

GPU architecture

• nVidia GeForce 9800 GTX

15

Page 15: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Sparse LU factorization on GPU

• Overall flow• Preprocessing• Exposing more parallelism• Ensuring timing order• Optimizing memory access

pattern

16

Page 16: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Overall flow

17

Page 17: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Preprocessing

• HSL_MC64 algorithm to improve numeric stability.– Find permutation matrix.

• AMD (Approximate Minimum Degree) algorithm to reduce fill-ins.– Find permutation matrix.

• G/P algorithm based pre-factorization (a complete numeric factorization with partial pivoting) to calculate the symbolic structure of the LU factors.– Can extract the total flops.

18

Page 18: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Exposing more parallelism

• Sequential G/P left-looking algorithm

19

Page 19: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

20

Page 20: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Exposing more parallelism

• Dependency graph and scheduler

21

Page 21: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Exposing more parallelism (cont.)

• Treat the threads that process the same column as a virtue group.

• In cluster mode– Columns are very sparse, so while

ensuring enough threads in total, we make virtue groups as small as possible to minimize idle threads.

• In pipeline mode– Columns usually contain enough

nonzeros for a warp or several warps. So the size of virtue groups matters little in the sense of reducing idle threads. We use one warp as one virtue group.22

Page 22: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Ensuring timing order

Page 23: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Ensuring timing order (cont.)

• Suppose column 8, 9 and 10 are being processed, and other columns are finished. Column 9 can be first updated with column 4, 6, 7, corresponding to the solid green arrows. But currently column 9 can not be updated with column 8. It must wait for column 8 to finish. Similar situation for column 10.

24

Page 24: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Ensuring timing order (cont.)

• Key is to avoid deadlock.– Not all warps are active at the

beginning.– If we active warps in wrong order in

the pipeline mode, deadlock will occur.

• There is no inter-warp context switching.

25

Page 25: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Optimizing memory access pattern

• The main difference between CPU and GPU parallel programming is the memory access.

• Two alternative data formats for the intermediate vectors (x in Algorithm 2). – CSC(Compressed Sparse Column) sparse vectors.

• Save space and can be placed in shared memory.– Dense arrays.

• Have to reside in global memory.

• Two reasons to choose dense arrays.– CSC format is inconvenient for indexed accesses.– Using too much shared memory would reduce the

number of resident warps per SM, and hence performance degradation. 26

Page 26: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

CSC format

• Specified by the 3 arrays {val, row_ind, col_ptr}, where row_ind stores the row indices of each nonzero, and col_ptr stores the index of the elements in val which start a column of A.

• Example

27

Page 27: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Improve data locality• Memory access coalescing.

– Several memory transactions can be coalesced into one transaction when consecutive threads access consecutive memory locations.

– Sort the nonzeros in L and U by their row indices to improve the data locality.

28

Page 28: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Effectiveness of sorting• GPU bandwidth increase is about 2.4x

on average.• CPU sparse LU factorization also

benefits from sorted nonzeros, but the bandwidth increase is only 1.15x.

29

Page 29: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Experimental results

• Environments– 2 Xeon E5405 CPUs, 2 Xeon X5680 CPUs,

AMD Radeon 5870 GPU, and NVIDIA GTX 580 GPU.

– Experiments on CPU are implemented in C on a 64-bit Linux server. Radeon 5870 is programmed using OpenCL v1.1. GTX580 is programmed with CUDA 4.0.

• Benchmarks are from University of Florida Sparse Matrix Collection.

30

Page 30: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Devices specifications

31

Page 31: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Performance and speedup

• Group A are cases under 200 Mflops.– Results are mostly worse than CPU

version.• Group B are cases over 200 Mflops.• Group C are cases that contains

many denormal number during factorization.– CPU cannot handle it in normal speed so

that great speedup achieved by GPU.

32

Page 32: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Performance and speedup (cont.)

• We can see the GPU bandwidth is positively related to Mflops, which indicates that in sparse LU factorization, the high memory bandwidth of GPU can be exploited only when the problem scale is large enough.

33

Page 33: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Scalability Analysis

• The average and detail performance on the four devices are listed in table and figure, respectively.

34

Page 34: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Scalability Analysis (cont.)

• The best performance is attained with about 24 resident warps per SM, rather than with maximum resident warps.

• On GTX 580, it achieves 74% peak bandwidth at most.

35

Page 35: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Scalability Analysis (cont.)

• On Radeon 5870, it achieves 45% peak bandwidth at most (on xenon1). A primary reason is that there are too few active wavefronts on Radeon 5870 to fully utilize the global memory bandwidth.

• On the two CPUs and Radeon 5870 GPU, the bandwidth keeps increasing with the issued threads (wavefronts).

36

Page 36: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Hybrid solver for circuit simulation

• The matrices with few flops in factorization are not suitable for GPU acceleration.– Combine both CPU/GPU version as

a hybrid solver.– Choose one of them based on flops

computation.

37

Page 37: Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Conclusions

• Sparse matrix solver is one of the runtime bottleneck of SPICE.

• Propose the first work on GPU-based sparse LU factorization intended for circuit simulation.

• Experiments demonstrate that GPU outperforms CPU on matrices with many floating point operations in factorization.

38