unveiling cellular mechanisms using gpu-based sparse

28
Unveiling Cellular Mechanisms Using GPU-based Sparse Linear Algebra Maggioni Marco (CS, PhD Research Assistant) Tanya Berger-Wolf (CS, Associate Professor) Jie Liang (BIOE, Professor)

Upload: others

Post on 28-Feb-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Unveiling Cellular Mechanisms Using GPU-based Sparse Linear Algebra

Maggioni Marco (CS, PhD Research Assistant)!Tanya Berger-Wolf (CS, Associate Professor)!

Jie Liang (BIOE, Professor)!

Outline ¡  Motivations

¡  Background ¡  Chemical Master Equation

¡  Working example

¡  Linear algebra methods ¡  Steady-state

¡  Transient dynamic

¡  GPU-based implementation ¡  Domain-specific optimization strategies ¡  Warp-grained ELL format

¡  Experimental results and conclusions

1

Motivations ¡  The Chemical Master Equation provides a fundamental stochastic framework to

study biochemical reaction networks[1]

¡  Steady-state and dynamics of the system can be derived by using linear algebra methods

¡  GPUs can deal with the computational demand ¡  Solve the same reaction network under different condition

¡  An exponential growth in the number of framework microstate

2

¡  [1] Jie Liang and Hong Qian, “Computational Cellular Dynamics Based on the Chemical Master Equation: A Challenge for Understanding Complexity”, Journal of Computer Science and Technology, 2010

Outline ¡  Motivations

¡  Background ¡  Chemical Master Equation

¡  Working example

¡  Linear algebra methods ¡  Steady-state

¡  Transient dynamic

¡  GPU-based implementation ¡  Domain-specific optimization strategies ¡  Warp-grained ELL format

¡  Experimental results and conclusions

3

Background ¡  In the theory of the Chemical Master Equation, the dynamics of a biochemical reaction

system, in a small volume, is represented by a discrete-state, continuous-time Markov process

¡  Microstates defined by molecular species count

¡  Transition from xj to xi defined by a rate

¡  Landscape probability over microstates

4

!

x = {x1,x2,...,xm}"Nm

!

Ak (xi,x j ) = rkxici

"

# $

%

& '

i=1

m

(¡  rk is the k-reaction rate

¡  ci is a species involved in the reaction

Background ¡  The discrete Chemical Master Equation describes the change of probability of each

microstate due to reactions

The previous can be written in matrix form by defining

where P(t) is the landscape probability at time t and A is the reaction rate matrix

5

!

dP(x, t)dt

= A(x,x ')P(x ',t) " A(x ',x)P(x, t)[ ]x#x'$

!

A(x,x) = " A(x ',x)x#x'$

%

& '

(

) *

!

dP(t)dt

= AP(t)

ingoing

outgoing

Background

¡  Matrix A is a infinitesimal generator of a Markov model

¡  Steady state as solution of a system of linear equations

¡  Dynamics as a matrix exponential

6

!

dP(t)dt

= 0"AP = 0

!

dP(t)dt

= AP "P(t) = eAtP(0)

Insight about the macroscopic states of the system

Stochastic simulation to study the dynamics of the biological mechanisms

¡  Steady-state probability landscape of toggle network

Working example

7

Insight about the macroscopic states of the system

Toggle network

Working example ¡  Dynamics of the landscape probability of toggle network

8

!

limt"#

eAtP(0)[ ] = P

Stochastic simulation to study the dynamics of the biological mechanisms Initial landscape is uniformly distributed, then converge to steady state

Outline ¡  Motivations

¡  Background ¡  Chemical Master Equation

¡  Working example

¡  Linear algebra methods ¡  Steady-state

¡  Transient dynamic

¡  GPU-based implementation ¡  Domain-specific optimization strategies ¡  Warp-grained ELL format

¡  Experimental results and conclusions

9

Steady-state ¡  Steady-state equivalent to solve a system of linear equations

¡  Jacoby iteration

¡  Easy to parallelize (very similar to SpMV)

¡  Numerically well-suited for ill-conditioned reaction rate matrices

¡  Iteration derived from simple decomposition

¡  Iterative convergence (spectral radius ρ(M)<1)

10

!

Ax = 0! x = "D"1(L +U)#$ %&x! xk+1 =Mxk = x +M! k

Steady-state ¡  Each row is an independent computation

¡  Trade-off between convergence and parallelism

¡  Normalized stopping criterion based on residual vector r =Ax

¡  Also check stagnation and maximum iterations

11

!

xik+1 = !aii

!1 aij x jk

j"i#$

%&&

'

())

r!

A!x

!

" !

Transient dynamic ¡  Matrix exponential is a series of infinite terms

¡  Create a dense matrix (not feasible “fill-in”)

¡  Krylov-based approximation technique introduced in [2]

¡  Reduced-size projection Hm of matrix A on a Krylov subspace

¡  Compute matrix exponential explicitly with Padè approximant

12

!

P(t) = eAtP0 =1k!k=0

"

# (At)k P0

¡  [2] Roger B. Sidje and William J. Stewart, “A numerical study of large sparse matrix exponentials arising in Markov chains”, Computational Statistics & Data Analysis, 1999

Transient dynamic ¡  Iterative algorithm to calculate

¡  Dynamics is approximated as

13

Basis normalization (unit vector)

Orthogonal basis

Span the next span vector of Krylov subspace

!

eAtP0 " #Vm+1eHm+1te1

Outline ¡  Motivations

¡  Background ¡  Chemical Master Equation

¡  Working example

¡  Linear algebra methods ¡  Steady-state

¡  Transient dynamic

¡  GPU-based implementation ¡  Domain-specific optimization strategies ¡  Warp-grained ELL format

¡  Experimental results and conclusions

14

¡  Well defined structure…

¡  Each row has a bound of the number of nonzeros (limited by reaction in the network)

¡  ELL sparse format[3] well-suited for vector architecture

¡  Two dense n x k arrays

¡  Regular memory access

¡  Coalesced (column-major)

¡  Aligned (padding)

Domain-specific optimization strategies

15

¡  [3] Nathan Bell and Michael Garland, “Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors”, Conference on High Performance Computing Networking, Storage and Analysis, 2009

!

efficiencyELL =nnznk

Domain-specific optimization strategies ¡  Well defined structure

¡  The diagonal is dense since

¡  Diagonal stored as a separate dense vector

16

!

A(x,x) = " A(x ',x)x#x'$

%

& '

(

) *

¡  Reduced column indexing

¡  Readily available for Jacobi iteration

!

xik+1 = "aii

"1 aij x jk

j# i$%

& ' '

(

) * *

Domain-specific optimization strategies ¡  Well-defined structure

¡  Biochemical networks often include reversible reactions

¡  ELL+DIA combined format

17

¡  DFS visit exposes a densely populated band due to long chain of reversible microstates

¡  Readily available from [4], no additional reordering

¡  DIA provides coalesced access to x, but not always aligned

¡  [4] Youfang Cao and Jie Liang, “Optimal enumeration of state space of finitely buffered stochastic molecular networks and exact computation of steady state landscape probability”, BMC Systems Biology, 2008

Warp-grained ELL format ¡  Based on the Sliced ELL [5] sparse format

¡  More efficiency with local ELL data structures

¡  Warp-grained ELL based on warp-sized slices

18

¡  Decouple slice size s (data structure) from block size b (CUDA block)

¡  Efficiency + SM occupancy

¡  Local row rearrangement ¡  Decrease nonzero variability

within warps

¡  Done locally within blocks to preserve cache locality

¡  [5] Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan, “Automatically tuning sparse matrix-vector multiplication for GPU architectures”, HiPEAC, 2010

Outline ¡  Motivations

¡  Background ¡  Chemical Master Equation

¡  Working example

¡  Linear algebra methods ¡  Steady-state

¡  Transient dynamic

¡  GPU-based implementation ¡  Domain-specific optimization strategies ¡  Warp-grained ELL format

¡  Experimental results and conclusions

19

¡  Experimental Setup ¡  Fermi architecture [email protected], 512 CUDA cores, 3GB GDDR5@192GB/s

¡  Quad-socket Opteron [email protected] GHz, 16 cores, 128GB [email protected]/s

¡  Double precision floating-point, 48 KB L1 cache

¡  Benchmarks

Experimental results

20

¡  Sparse MVP with ELL+DIA format

Experimental results

21

¡  Block size b=256 (chosen by exhaustive testing)

¡  48KB L1 cache gives 6% average improvement over 16 KB

¡  Best speed up with benchmarks with dense diagonal band

¡  Sparse MVP with warp-grained ELL format

Experimental results

22

¡  Warp-grained ELL has a 8% improvement over ELL and a 6% over Sliced ELL

¡  Average memory footprint reduced from 440.98 Mbytes (ELL) to 322.45 Mbytes, slightly better than CSR

¡  Global reordering decrease average performance by 6%

¡  [6] Bor-Yiing Su and Kurt Keutzer, “clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs”, ICS, 2012

¡  Sparse MVP with warp-grained ELL format (Florida matrix collection)

Experimental results

23

¡  Steady state probability (Jacoby solver)

Experimental results

24

¡  Compared to multicore implementation derived from Intel MKL library

¡  Error set to ε=10-8

¡  Transient dynamic probability (Arnoldi method)

Experimental results

25

¡  Speed up on Arnoldi method is slightly lower due to additional vector routines

Conclusions ¡  We have presented a GPU-based implementation of steady state (15.67x)

and transient dynamic (12.60x) for studying biochemical reaction networks

¡  Kepler architecture should give benefit in terms of memory hierarchy (bandwidth and read-only cache)

¡  GPU clusters can virtually extend to any realistic network ¡  A new direction of biological computing, comparable to molecular dynamic

simulation

¡  Work generalizable to any large Markov models

¡  Analysis of the dynamic landscape to identify rare events

26

Questions

27

¡  Marco Maggioni, Tanya Berger-Wolf and Jie Liang, “GPU-based Steady-State Solution of the Chemical Master Equation”, HiCOMB, 2013

¡  This work was supported by NSF grants IIS-106468 and DBI-1062328, and NIH grant GM-079804