for more information

1
for more information ... http://www.tops-scidac.org Performance Tuning TOPS is providing applications with highly efficient implementations of common sparse matrix computational kernels, automatically tuned for a user’s kernel, matrix, and machine. Trends and The Need for Automatically Tuned Sparse Kernels Search-based Methodology for Automatic Performance Tuning Impact on Applications and Evaluation of Architectures Approach to automatic tuning Identify and generate a space of implementations Search this space using empirical models and experiments Example: Choosing an rxc block size Off-line benchmark [machine] Mflops(r,c) for dense matrix in sparse format Run-time search [matrix] Estimate Fill(r,c) for all r, c Heuristic model [combine] Choose r, c to maximize: Estimated Mflops = Mflops(r,c) / Fill(r,c) Yields performance within 10% of best r, c Performance Optimizations for SpMV Register blocking (RB): up to 4x speedups over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonal segmenting: 2x over CSR Reordering to create dense structure + splitting: 2x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 2.2x over CSR Multiple vectors: 7x over CSR And combinations… Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR Higher-level kernels AA T *x, A T A*x: 4x over CSR, 1.8x over RB A 2 *x: 2x over CSR, 1.5x over RB Matrix triple products, … Best Reference (CSR) Dense (90% of non-zeros) Less than 10% of peak: Typical untuned sparse matrix-vector multiply (SpMV) performance is below 10% of peak on modern cache-based superscalar machines. With careful tuning, 2x speedups and 30% of peak or more are possible. The optimal choice of tuning parameters can be surprising: (Left) A matrix that naturally contains 8x8 dense blocks. (Right) On an Itanium 2, the optimal block size of 4x2 achieves 1.1 Gflop/s (31% of peak) and is over 4x faster than the conventional unblocked (1x1) implementation. Extra work can improve performance: Filling in explicit zeros (shown as x) followed by 3x3 blocking increases the number of flops by 1.5x for this matrix, but SpMV still runs in 1.5x less time than not blocking on a Pentium III because the raw speed in Mflop/s increases by 2.25x. Off-line benchmarking characterizes the machine: For r x c register blocking, performance as a function of r and c varies across platforms. (Left) Ultra 3, 1.8 Gflop/s peak. (Right) Itanium 2, 3.6 Gflop/s peak. Complex combinations of dense substructures arise in practice. We are developing tunable data structures and implementations, and automated tuning parameter selection techniques. Potential improvements to Tau3P/T3P/Omega3P, SciDAC accelerator cavity design applications by Ko, et al., at the Stanford Linear Accelerator Center (SLAC): (Left) Reordering matrix rows and columns, based on approximately solving the Traveling Salesman Problem (TSP), improves locality by creating dense block structure. (Right) Combining TSP reordering, symmetric storage, and register-level blocking leads to uniprocessor speedups between 1.5–3.3x compared to a naturally ordered, non-symmetric blocked implementation. Before: Green + Red After: Green + Blue Public software release Low-level “Sparse BLAS” primitives Integration with PETSc Integration with DOE applications SLAC collaboration Geophysical simulation based on Block Lanczos (ATA*X; LBL) New sparse benchmarking effort With University of Tennessee Multithreaded and MPI versions Sparse kernels Automatic tuning of MPI collective ops Pointers Berkeley Benchmarking and Optimization (BeBOP) bebop.cs.berkeley.edu Self-Adapting Numerical Software (SANS) Effort icl.cs.utk.edu /sans Mflop/s 90 50 Mflop/s 1190 190 Evaluating SpMV performance across architectures: Using a combination of analytical modeling of performance bounds and benchmarking tools being developed by SciDAC-PERC, we are studying the impact of architecture on sparse kernel performance. Current and Future Work

Upload: alma-jordan

Post on 31-Dec-2015

23 views

Category:

Documents


0 download

DESCRIPTION

Best. Reference (CSR). Mflop/s 1190. Mflop/s 90. Dense (90% of non-zeros). 50. 190. Performance Tuning. TOPS is providing applications with highly efficient implementations of common sparse matrix computational kernels, automatically tuned for a user’s kernel, matrix, and machine. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: for more information

for more information ... http://www.tops-scidac.org

Performance TuningTOPS is providing applications with highly efficient implementations of common sparse matrix computational kernels, automatically tuned for a user’s kernel, matrix, and machine.

Trends and The Need for Automatically Tuned Sparse Kernels

Search-based Methodology for Automatic Performance Tuning

Impact on Applications and Evaluation of Architectures

Approach to automatic tuning Identify and generate a space of implementations Search this space using empirical models and experiments

Example: Choosing an rxc block size Off-line benchmark [machine]

Mflops(r,c) for dense matrix in sparse format Run-time search [matrix]

Estimate Fill(r,c) for all r, c Heuristic model [combine]

Choose r, c to maximize: Estimated Mflops = Mflops(r,c) / Fill(r,c) Yields performance within 10% of best r, c

Performance Optimizations for SpMV Register blocking (RB): up to 4x speedups over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonal segmenting: 2x over CSR Reordering to create dense structure + splitting: 2x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 2.2x over CSR Multiple vectors: 7x over CSR And combinations…

Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR

Higher-level kernels AAT*x, ATA*x: 4x over CSR, 1.8x over RB A2*x: 2x over CSR, 1.5x over RB Matrix triple products, …

Best

Reference(CSR)

Dense(90% of non-zeros)

Less than 10% of peak: Typical untuned sparse matrix-vector multiply (SpMV) performance is below 10% of peak on modern cache-based superscalar machines. With careful tuning, 2x speedups and 30% of peak or more are possible.

The optimal choice of tuning parameters can be surprising: (Left) A matrix that naturally contains 8x8 dense blocks. (Right) On an Itanium 2, the optimal block size of 4x2 achieves 1.1 Gflop/s (31% of peak) and is over 4x faster than the conventional unblocked (1x1) implementation.

Extra work can improve performance: Filling in explicit zeros (shown as x) followed by 3x3 blocking increases the number of flops by 1.5x for this matrix, but SpMV still runs in 1.5x less time than not blocking on a Pentium III because the raw speed in Mflop/s increases by 2.25x.

Off-line benchmarking characterizes the machine: For r x c register blocking, performance as a function of r and c varies across platforms. (Left) Ultra 3, 1.8 Gflop/s peak. (Right) Itanium 2, 3.6 Gflop/s peak.

Complex combinations of dense substructures arise in practice. We are developing tunable data structures and implementations, and automated tuning parameter selection techniques.

Potential improvements to Tau3P/T3P/Omega3P, SciDAC accelerator cavity design applications by Ko, et al., at the Stanford Linear Accelerator Center (SLAC): (Left) Reordering matrix rows and columns, based on approximately solving the Traveling Salesman Problem (TSP), improves locality by creating dense block structure. (Right) Combining TSP reordering, symmetric storage, and register-level blocking leads to uniprocessor speedups between 1.5–3.3x compared to a naturally ordered, non-symmetric blocked implementation.

Before: Green + RedAfter: Green + Blue

Public software release Low-level “Sparse BLAS” primitives Integration with PETSc

Integration with DOE applications SLAC collaboration Geophysical simulation based on Block Lanczos

(ATA*X; LBL)

New sparse benchmarking effort With University of Tennessee

Multithreaded and MPI versions Sparse kernels Automatic tuning of MPI collective ops

Pointers Berkeley Benchmarking and Optimization (BeBOP)

bebop.cs.berkeley.edu Self-Adapting Numerical Software (SANS) Effort

icl.cs.utk.edu/sans

Mflop/s90

50

Mflop/s1190

190

Evaluating SpMV performance across architectures: Using a combination of analytical modeling of performance bounds and benchmarking tools being developed by SciDAC-PERC, we are studying the impact of architecture on sparse kernel performance.

Current and Future Work