blasfeo - university of texas at austin

31
BLASFEO Gianluca Frison University of Freiburg BLIS retreat September 19, 2017 Gianluca Frison BLASFEO

Upload: others

Post on 18-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BLASFEO - University of Texas at Austin

BLASFEO

Gianluca Frison

University of Freiburg

BLIS retreatSeptember 19, 2017

Gianluca Frison BLASFEO

Page 2: BLASFEO - University of Texas at Austin

BLASFEO

I Basic Linear Algebra Subroutines For Embedded Optimization

Gianluca Frison BLASFEO

Page 3: BLASFEO - University of Texas at Austin

BLASFEO performance

I Intel Core i7 4800MQI BLASFEO HPI OpenBLAS 0.2.19I MKL 2017.2.174I ATLAS 3.10.3I BLIS 0.1.6

I CAVEAT: BLASFEO isnot API compatible withBLAS & LAPACK

0

10

20

30

40

50

0 50 100 150 200 250 300

Gflo

ps

matrix size n

dgemm_nt

BLASFEO_HP

OpenBLAS

MKL

ATLAS

BLIS

0

10

20

30

40

50

0 50 100 150 200 250 300

Gflo

ps

matrix size n

dpotrf_l

BLASFEO_HP

OpenBLAS

MKL

ATLAS

BLIS

Gianluca Frison BLASFEO

Page 4: BLASFEO - University of Texas at Austin

Framework: embedded optimization

I Framework: embedded optimization and control

Gianluca Frison BLASFEO

Page 5: BLASFEO - University of Texas at Austin

Embedded control

Model Predictive Control (and Moving Horizon Estimation)

I solve an optimization problem on-line

I sampling times in milli- or micro-second range

Gianluca Frison BLASFEO

Page 6: BLASFEO - University of Texas at Austin

Karush-Kuhn-Tucker optimality conditions

KKT system (for N = 2)

Q0 ST0 AT

0

S0 R0 BT0

A0 B0 −I−I Q1 ST

1 AT1

S1 R1 B1

A1 B1 −I−I Q2

x0u0λ0

x1u1λ1

x2

=

−q0−r0−b0−q1−r1−b1−q2

I Large, structured system of linear equations

I Sub-matrices are assumed dense or diagonal

Gianluca Frison BLASFEO

Page 7: BLASFEO - University of Texas at Austin

Framework: embedded optimization

Assumptions about embedded optimization:

I Computational speed is a key factor: solve optimizationproblems in real-time on resources-constrained hardware.

I Data matrices are reused several times (e.g. at eachoptimization algorithm iteration): look for a good datastructure.

I Structure-exploiting algorithms can exploit the high-levelsparsity pattern: data matrices assumed dense.

I Size of matrices is relatively small (tens or few hundreds):generally fitting in cache.

I Limitations of embedded optimization hardware and toolchain:no external libraries and (static/dynamic) memory allocation

Gianluca Frison BLASFEO

Page 8: BLASFEO - University of Texas at Austin

The origin

I HPMPC: library for High-Performance implementation ofsolvers for Model Predictive Contorl

Gianluca Frison BLASFEO

Page 9: BLASFEO - University of Texas at Austin

How to optimize syrk+potrf (for embedded optimization)

I How to optimizedsyrk + dpotrf(for embeddedoptimization)

I Test operation:

L =(Q+A · AT

)1/2

I NetlibBLAS

0

5

10

15

20

25

0 50 100 150 200 250 300

Gflops

matrix size n

test dsyrk + dpotrf

0

2

4

6

8

10

12

14

0 5 10 15 20

Gflops

matrix size n

test dsyrk + dpotrf

Gianluca Frison BLASFEO

Page 10: BLASFEO - University of Texas at Austin

How to optimize syrk+potrf (for embedded optimization)

Code Generation

I e.g. fix the size of theloops: compiler can unrollloops and avoid branches

I need to generate the codefor each problem size

0

5

10

15

20

25

0 50 100 150 200 250 300

Gflops

matrix size n

test dsyrk + dpotrf

0

2

4

6

8

10

12

14

0 5 10 15 20

Gflops

matrix size n

test dsyrk + dpotrf

Gianluca Frison BLASFEO

Page 11: BLASFEO - University of Texas at Austin

How to optimize syrk+potrf (for embedded optimization)

OpenBLAS0

5

10

15

20

25

0 50 100 150 200 250 300

Gflops

matrix size n

test dsyrk + dpotrf

0

2

4

6

8

10

12

14

0 5 10 15 20

Gflops

matrix size n

test dsyrk + dpotrf

Gianluca Frison BLASFEO

Page 12: BLASFEO - University of Texas at Austin

How to optimize syrk+potrf (for embedded optimization)

HPMPC - register blocking0

5

10

15

20

25

0 50 100 150 200 250 300

Gflops

matrix size n

test dsyrk + dpotrf

0

2

4

6

8

10

12

14

0 5 10 15 20

Gflops

matrix size n

test dsyrk + dpotrf

Gianluca Frison BLASFEO

Page 13: BLASFEO - University of Texas at Austin

How to optimize syrk+potrf (for embedded optimization)

HPMPC - SIMD instructions

I performance drop for nmultiple of 32 - limitedcache associativity

0

5

10

15

20

25

0 50 100 150 200 250 300

Gflops

matrix size n

test dsyrk + dpotrf

0

2

4

6

8

10

12

14

0 5 10 15 20

Gflops

matrix size n

test dsyrk + dpotrf

Gianluca Frison BLASFEO

Page 14: BLASFEO - University of Texas at Austin

How to optimize syrk+potrf (for embedded optimization)

HPMPC - panel-major matrixformat

I panel-major matrix format

I smooth performance

0

5

10

15

20

25

0 50 100 150 200 250 300

Gflops

matrix size n

test dsyrk + dpotrf

0

2

4

6

8

10

12

14

0 5 10 15 20

Gflops

matrix size n

test dsyrk + dpotrf

Gianluca Frison BLASFEO

Page 15: BLASFEO - University of Texas at Austin

Access pattern in optimized BLAS

Figure: Access pattern of data in different cache levels for the dgemm

routine in GotoBLAS/OpenBLAS/BLIS. Data is packed (on-line) intobuffers following the access pattern.

Gianluca Frison BLASFEO

Page 16: BLASFEO - University of Texas at Austin

Panel-major matrix format

+ = ·

ps

T

I matrix format: can do any operation, also on sub-matrices

I matrix elements are stored in (almost) the same order such asthe gemm kernel accesses them

I optimal ’NT’ gemm variant (A not-transposed, B transposed)

I panels width ps is the same for the left and the right matrixoperand, as well as for the result matrix

Gianluca Frison BLASFEO

Page 17: BLASFEO - University of Texas at Austin

Optimized BLAS vs HPMPC software stack

linear algebra kernels

linear algebra kernels

optimized BLAS based solver

HPMPC based solver

linear algebra routines

linear algebra routines

Riccati solver for unconstrained MPC

IPM solver for linear MPC

IPM solver for linear MPC

Riccati solver for unconstrained MPC

high-level wrapper

packing of matrices

Part I

Part II

Part III

Figure: Structure of a Riccati-based IPM for linear MPC problems whenimplemented using linear algebra in either optimized BLAS or HPMPC.Routines in the orange boxes use matrices in column-major format,routines in the green boxes use matrices in panel-major format.

Gianluca Frison BLASFEO

Page 18: BLASFEO - University of Texas at Austin

How to optimize syrk+potrf (for embedded optimization)

HPMPC - merging of linearalgebra routines

I specialized kernels forcomplex operations

I improves small-scaleperformance

I worse large-scaleperformance

0

5

10

15

20

25

0 50 100 150 200 250 300

Gflops

matrix size n

test dsyrk + dpotrf

0

2

4

6

8

10

12

14

0 5 10 15 20

Gflops

matrix size n

test dsyrk + dpotrf

Gianluca Frison BLASFEO

Page 19: BLASFEO - University of Texas at Austin

Merging of linear algebra routines: syrk + potrf

L =(Q +A · AT

)1/2=L00 ∗ ∗

L10 L11 ∗L20 L21 L22

=

Q00 ∗ ∗Q10 Q11 ∗Q20 Q21 Q22

+

A0A1A2

· [AT0 AT

1 AT2

]1/2

=

(Q00 +A0 · AT0 )1/2 ∗ ∗

(Q10 +A1 · AT0 )L−T

00 (Q11 +A1 · AT1 − L10 · LT

10)1/2 ∗

(Q20 +A2 · AT0 )L−T

00 (Q21 +A2 · AT1 − L20 · LT

10)L−T11 (Q22 +A2 · AT

2 − L20 · LT20 − L21 · LT

21)1/2

I each sub-matrix computed using a single specialized kernelI reduce number of function callsI reduce number of load and store of the same data

Gianluca Frison BLASFEO

Page 20: BLASFEO - University of Texas at Austin

The present

I BLASFEO

Gianluca Frison BLASFEO

Page 21: BLASFEO - University of Texas at Austin

BLASFEO

I aim: satisfy the need for high-performance linear algebraroutines for small dense matrices

I mean: adapt high-performance computing techniques toembedded optimization framework

I keep:I LA kernels (register-blocking, SIMD)I optimized data layout

I drop:I cache-blockingI on-line data packingI multi-thread (at least for now)

I add:I optimized (panel-major) matrix formatI off-line data packingI assembly subroutines: tailored LA kernels & code reuse

Gianluca Frison BLASFEO

Page 22: BLASFEO - University of Texas at Austin

Linear algebra level

I LA level definitionI level 1: O(n) storage, O(n) flopsI level 2: O(n2) storage, O(n2) flopsI level 3: O(n2) storage, O(n3) flops

I in level 1 and level 2I reuse factor O(1)I memory-bounded

I in level 3I reuse factor O(n)I compute-bounded for large n

I disregard O(n2) terms

I typically memory-bounded for small nI minimize O(n2) terms ⇒ BLASFEO playground!

Gianluca Frison BLASFEO

Page 23: BLASFEO - University of Texas at Austin

BLASFEO RF

BLASFEO ReFerence

I 2x2 blocking for registers

I column-major matrix format

I small code size

I ANSI C code, no external dependencies

I good performance for tiny matrices

Gianluca Frison BLASFEO

Page 24: BLASFEO - University of Texas at Austin

BLASFEO HP

BLASFEO High-Performance

I optimal blocking for registers + vectorization

I no blocking for cache

I panel-major matrix format

I hand-written assembly kernels

I optimized for highest performance for matrices fitting in cache

Gianluca Frison BLASFEO

Page 25: BLASFEO - University of Texas at Austin

BLASFEO WR

BLASFEO WRapper to BLAS and LAPACK

I provides a performance basis

I column-major matrix format

I optimized for many architectures

I possibly multi-threaded

I good performance for large matrices

Gianluca Frison BLASFEO

Page 26: BLASFEO - University of Texas at Austin

Matrix structure

I problem: how to switch between panel-major andcolumn-major formats?

I solution: use a C struct for the matrix ”object”(..., struct d strmat *sA, int ai, int aj, ...)

I if column-major, the first matrix element is atint lda = sA->m;

double *ptr = sA->pA + ai + aj*lda;

I if panel-major, the first matrix element is atint ps = sA->ps;

int sda = sA->cn;

int air = ai&(ps-1);

double *ptr = sA->pA + (ai-air)*sda + air + aj*bs;

I custom matrix struct allows other tricksI extra linear storage for inverse of diagonal in factorizations

Gianluca Frison BLASFEO

Page 27: BLASFEO - University of Texas at Austin

Some tests on Intel Core i7 4800MQ (Haswell)

0

10

20

30

40

50

0 50 100 150 200 250 300

Gflo

ps

matrix size n

dgemm_nt

HP

RF

OB

MKL

Eig

0

10

20

30

40

50

0 50 100 150 200 250 300

Gflo

ps

matrix size n

dgemm_nn

HP

RF

OB

MKL

Eig

0

20

40

60

80

100

0 50 100 150 200 250 300

Gflo

ps

matrix size n

sgemm_nt

HP

RF

OB

MKL

Eig

0

20

40

60

80

100

0 50 100 150 200 250 300

Gflo

ps

matrix size n

sgemm_nn

HP

RF

OB

MKL

Eig

Gianluca Frison BLASFEO

Page 28: BLASFEO - University of Texas at Austin

Some tests on Intel Core i7 4800MQ (Haswell)

0

10

20

30

40

50

0 50 100 150 200 250 300

Gflo

ps

matrix size n

dpotrf_l

HP

RF

OB

MKL

Eig

0

10

20

30

40

50

0 50 100 150 200 250 300

Gflo

ps

matrix size n

dgetrf

HP

RF

OB

MKL

0

20

40

60

80

100

0 50 100 150 200 250 300

Gflo

ps

matrix size n

spotrf_l

HP

RF

OB

MKL

Eig

0

10

20

30

40

50

0 50 100 150 200 250 300

Gflo

ps

matrix size n

dgelqf

HP

RF

OB

MKL

Gianluca Frison BLASFEO

Page 29: BLASFEO - University of Texas at Austin

Some tests on ARM Cortex A15 and ARM Cortex A57

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 50 100 150 200 250 300

Gflo

ps

matrix size n

dgemm_nt

HP

RF

OB

Eig

0

2

4

6

8

10

12

14

16

18

0 50 100 150 200 250 300

Gflo

ps

matrix size n

sgemm_nt

HP

RF

OB

Eig

0

1

2

3

4

5

6

7

8

0 50 100 150 200 250 300

Gflo

ps

matrix size n

dgemm_nt

HP

RF

OB

0

2

4

6

8

10

12

14

16

0 50 100 150 200 250 300

Gflo

ps

matrix size n

sgemm_nt

HP

RF

OB

Gianluca Frison BLASFEO

Page 30: BLASFEO - University of Texas at Austin

BLASFEO + PLASMA on Intel Core i7 4800MQ

0

50

100

150

200

0 500 1000 1500 2000 2500 3000 3500 4000

Gflo

ps

matrix size n

dpotrf_l

MKL

PL_64+MKL

PL_64+HP

0

50

100

150

200

0 500 1000 1500 2000 2500 3000 3500 4000

Gflo

ps

matrix size n

dpotrf_l

MKL

PL_64+MKL

PL_64+HP

0

50

100

150

200

0 500 1000 1500 2000 2500 3000 3500 4000

Gflo

ps

matrix size n

dpotrf_l

MKL

PL_128+MKL

PL_128+HP

0

50

100

150

200

0 500 1000 1500 2000 2500 3000 3500 4000

Gflo

ps

matrix size n

dpotrf_l

MKL

PL_128+MKL

PL_128+HP

Gianluca Frison BLASFEO

Page 31: BLASFEO - University of Texas at Austin

The end (for now)

I thank you for your attention!

I questions?

Gianluca Frison BLASFEO