stochastic optimization of floating-point programs with tunable … · 2016. 11. 24. · overview...

32
Stochastic Optimization of Floating-point Programs with Tunable Precision eric schkufza, rahul sharma, alex aiken stanford university

Upload: others

Post on 20-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Stochastic Optimization of Floating-point Programs with Tunable Precision

eric schkufza, rahul sharma, alex aiken stanford university

Page 2: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Overview• Floating-point kernels: Nearly ubiquitous in high-

performance computing !

• Getting the best performance: Requires full exploitation of the dark corners of an instruction set and how it interacts with machine resources !

• But we can do even better: Give up on full precision for applications that don’t need it

2

Page 3: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Overview

• Stochastic Optimization: Automated method for generating high-performance kernels that trade a reduction in precision for improvements in code size and runtime

3

Page 4: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Goals and Related Work• Mixed fixed / floating-point kernels: Build on

previous work [schkufza 13, sharma 13] !

• Improved approximation: Give user finer-grained control [lam 13, rubio-gonzalez 13] !

• Assembly level: Direct optimization, no further trusted optimization kernel [sidiroglou-douskos 11] !

• Optimize kernels: Improve end-to-end performance automatically [chilimbi 10, zou 12]

4

Page 5: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Examplevmovddup %xmm0, %xmm0 vmulpd (%rdi), %xmm0, %xmm2 vroundpd $0, %xmm2, %xmm2 vmulpd 0x10(%rdi), %xmm2, %xmm1 vcvtpd2dqx %xmm2, %xmm3 vmulpd 0x20(%rdi), %xmm2, %xmm2 vaddpd %xmm1, %xmm0, %xmm1 vmovapd 0x30(%rdi), %xmm0 vpaddd 0x40(%rdi), %xmm3, %xmm3 vpslld $20, %xmm3, %xmm3 vpshufd $114, %xmm3, %xmm3 vaddpd %xmm2, %xmm1, %xmm1 vmulpd 0x50(%rdi), %xmm1, %xmm2 vaddpd 0x60(%rdi), %xmm2, %xmm2 vmulpd %xmm1, %xmm2, %xmm2 vaddpd 0x70(%rdi), %xmm2, %xmm2 vmulpd %xmm1, %xmm2, %xmm2 vaddpd 0x80(%rdi), %xmm2, %xmm2 vmulpd %xmm1, %xmm2, %xmm2 vaddpd 0x90(%rdi), %xmm2, %xmm2 vmulpd %xmm1, %xmm2, %xmm2 vaddpd 0xa0(%rdi), %xmm2, %xmm2 vmulpd %xmm1, %xmm2, %xmm2 vaddpd 0xb0(%rdi), %xmm2, %xmm2 vmulpd %xmm1, %xmm2, %xmm2 vaddpd 0xc0(%rdi), %xmm2, %xmm2 vmulpd %xmm1, %xmm2, %xmm2 vaddpd 0xd0(%rdi), %xmm2, %xmm2 vmulpd %xmm1, %xmm2, %xmm2 vaddpd 0xe0(%rdi), %xmm2, %xmm2 vmulpd %xmm1, %xmm2, %xmm2 vaddpd 0xf0(%rdi), %xmm2, %xmm2 vmulpd %xmm1, %xmm2, %xmm2 vaddpd %xmm0, %xmm2, %xmm2 vmulpd %xmm1, %xmm2, %xmm1 vaddpd %xmm0, %xmm1, %xmm0 vmulpd %xmm3, %xmm0, %xmm0 retq

5

vmulpd (%rdi), %xmm0, %xmm2 vroundpd $0xfffffffffffffffe, %xmm2, %xmm2 vcvttpd2dq %xmm2, %xmm3 vmulpd 0x10(%rdi), %xmm2, %xmm1 vlddqu 0x90(%rdi), %xmm2 vaddpd %xmm1, %xmm0, %xmm1 vmulpd %xmm1, %xmm2, %xmm2 vpaddw 0x40(%rdi), %xmm3, %xmm3 vmovapd 0x30(%rdi), %xmm0 vpsllq $0x14, %xmm3, %xmm3 vaddpd 0xa0(%rdi), %xmm2, %xmm2 vmulpd %xmm1, %xmm2, %xmm2 vaddpd 0xb0(%rdi), %xmm2, %xmm2 vmulsd %xmm1, %xmm2, %xmm2 vaddpd 0xc0(%rdi), %xmm2, %xmm2 vmulpd %xmm1, %xmm2, %xmm2 vaddpd 0xd0(%rdi), %xmm2, %xmm2 vmulpd %xmm1, %xmm2, %xmm2 vaddpd 0xe0(%rdi), %xmm2, %xmm2 vmulsd %xmm1, %xmm2, %xmm2 vaddpd 0xf0(%rdi), %xmm2, %xmm2 vmulsd %xmm1, %xmm2, %xmm2 vaddsd %xmm0, %xmm2, %xmm2 vpshufd $0x3, %xmm3, %xmm3 vmulpd %xmm1, %xmm2, %xmm1 vaddsd %xmm0, %xmm1, %xmm0 vmulpd %xmm3, %xmm0, %xmm0 retq

STOKEexpert

Page 6: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Optimization Notes

• Smaller kernel: 38 LOC reduced to 28 LOC

• Performance improvement: 57% kernel speedup, produces a 27% overall task speedup

• Highly specialized: Obeys application-specific error bound requirements for all inputs between -3.0 and 0

6

Page 7: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Intuition

7

Abstract space of programs (rewrites): Blue regions contains points that are bit-wise equivalent to the target

STOKE

expert

gcc -O3

Page 8: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Intuition

8

Few opportunities: Floating-point instruction set semantics complicate optimization

STOKE

expert

gcc -…

Page 9: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Intuition

9

Relax correctness: Most applications don’t require full precision results for all possible inputs

STOKE

gcc -…

expert

Page 10: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Intuition

10

New opportunities: Gain access to high performance optimizations that were previous inaccessible

STOKE

gcc -…

expert

Page 11: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Intuition

11

Non-trivial task: Semantics preserving transformations ill-suited to this task; prefer stochastic search [schkufza 13]

STOKE

gcc -…

expert

Page 12: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

What’s Required

1. Search Procedure: Markov Chain Monte Carlo Sampling

2. Cost Function: Formal encoding of rewrite quality to guide search; should balance competing constraints of precision and performance

12

Page 13: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

MCMC Sampling• Widely used: For many domains, the only known

tractable solution method for high dimensional irregular search spaces [andrieu 03][chenney 00]

• Guarantees: Draws samples in proportion to their value; higher value points are sampled more frequently

• No claim of convergence rate: Works well in practice for the benchmarks that we consider

13

Page 14: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

MCMC SamplingAlgorithm:!

1. Select an initial program

2. Repeat (millions to billions of times)

A. Propose a random change and evaluate cost

B. If ( decreased ) { accept }

C. If ( increased ) { maybe accept anyway }

14

Page 15: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Transformations

15

original!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!...

Page 16: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Transformations

16

original!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!...

insert!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!imulq rsi, rdx!...

Page 17: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Transformations

17

original!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!...

insert!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!imulq rsi, rdx!...

delete!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!...

Page 18: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Transformations

18

original!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!...

insert!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!imulq rsi, rdx!...

delete!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!...

instruction!!...!movl ecx, ecx!shrq 32, rsi!salq 16, rcx!movq rcx, rax!movl edx, edx!imulq r9, rax!...

Page 19: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Transformations

19

original!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!...

insert!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!imulq rsi, rdx!...

delete!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!...

instruction!!...!movl ecx, ecx!shrq 32, rsi!salq 16, rcx!movq rcx, rax!movl edx, edx!imulq r9, rax!...

opcode!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!subl edx, edx!imulq r9, rax!...

Page 20: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Transformations

20

original!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!...

insert!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!imulq rsi, rdx!...

delete!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!...

instruction!!...!movl ecx, ecx!shrq 32, rsi!salq 16, rcx!movq rcx, rax!movl edx, edx!imulq r9, rax!...

opcode!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!subl edx, edx!imulq r9, rax!...

operand!!...!movl ecx, ecx!shrq 32, rcx!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!...

Page 21: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Transformations

21

original!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!...

insert!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!imulq rsi, rdx!...

delete!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!...

instruction!!...!movl ecx, ecx!shrq 32, rsi!salq 16, rcx!movq rcx, rax!movl edx, edx!imulq r9, rax!...

opcode!!...!movl ecx, ecx!shrq 32, rsi!andl ff, r9d!movq rcx, rax!subl edx, edx!imulq r9, rax!...

operand!!...!movl ecx, ecx!shrq 32, rcx!andl ff, r9d!movq rcx, rax!movl edx, edx!imulq r9, rax!...

swap!!...!movl ecx, ecx!movl edx, edx!andl ff, r9d!movq rcx, rax!shrq 32, rsi !imulq r9, rax!...

Page 22: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Precision Function

22

… … … …

… … … …

… … … …

execute

testcase target rewrite

Comparison: Execute target and rewrite on identical state and compare live outputs

Page 23: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Precision Function• Uncertainty in Last Place: Measures the distance

between a real number and the closest representable floating-point value

• Widely Used: Most scientific applications measure precision in terms of ULPs; 0.5 is the gold standard but very expensive to obtain, most settle for 1 to 2

23

4 e

11

Page 24: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Verification

24

• Guarantees: How do we know that an optimization is “precise enough” for all possible inputs?

!• Decision Procedures: Bit-blasting techniques don’t

scale beyond a few lines of code; can’t handle mixed fixed- and floating-point codes [darulova 14]

!• Abstract Interpretation: Can’t prove even bitwise

equality in many common cases; can’t handle mixed fixed- and floating-point codes [haller 12]

!• Bottom Line: No standard techniques can do this

Page 25: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Validation• Relaxed problem statement: Claim with high

confidence that there is no input that will cause an error in excess of maximum user bound

• Error function: Run original and optimized code on identical inputs and measure ULP Error

error(x) = ULP(eval(target, x), eval(rewrite, x))

• Goal: Show max error is below user-defined bound

25

Page 26: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Validation

26

Search: Use MCMC sampling to search for test case inputs that maximize the error function; just one transform

… … … …

… … … …

live-in

test case test case’

Page 27: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Termination

27

• Mixing Tests: Statistical tests [geweke 92] for producing a high-confidence guarantee that search has sampled uniformly across the domain of a function

• Therefore: High confidence that search has discovered all local maxima implies high confidence that it has discovered the global maximum

0

50

100

not converged0

50

100

converged

Page 28: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Evaluation

• Numeric simulation: S3D, 3-dimensional direct numerical solver for HCCI combustion

• C library: libimf, Intel’s hand-written implementation of the C numerics library math.h

• Computer graphics: A ray tracer

28

Page 29: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

libimf

29

ref0 6 12 18

0

20

40

60

80

1

3

5

Minimum ULP (10^)

LO

C

Spe

ed

up (

x1

00%

)

ref0 6 12 18

0

20

40

60

80

1

2

3

Minimum ULP (10^)

LO

C

Spe

ed

up

(x1

00

%)

ref0 6 12 18

0

30

60

90

120

1

4

7

Minimum ULP (10^)

LO

C

Spe

ed

up (

x1

00%

)

-3.14 -1.57 0 1.57 3.141e0

1e4

1e8

1e12

1e16

1e20

Input Argument

UL

P E

rro

r

0 2 4 6 8 101E+0

1E+4

1E+8

1E+12

1E+16

1E+20

Input Argument

UL

P E

rro

r

0 2 4 6 8 101e0

1e4

1e8

1e12

1e16

1e20

Input Argument

UL

P E

rro

r

-1.6 -0.8 0 0.8 1.61E+0

1E+4

1E+8

1E+12

1E+16

1E+20

Input Argument

UL

P E

rro

rsin() log() tan()

Page 30: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

• Bit-wise correct: 30% speedup optimizing vector kernels

• Depth of field blur: Random perturbations made to viewing camera angle; 6% speedup by relaxing precision requirements

!

!

Ray Tracer

30bit-wise correct relaxed precision error pixels (white)

Page 31: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Overfitting

31bit-wise correct over-relaxed precision error pixels (white)

• Over-relaxation: If minimum tolerable error exceeds the variance of random perturbations, they are removed altogether

• Faster still: But depth of field blur has been removed

!

!

Page 32: Stochastic Optimization of Floating-point Programs with Tunable … · 2016. 11. 24. · Overview • Floating-point kernels: Nearly ubiquitous in high- performance computing ! •

Summary• Micro-optimization: For many interesting application

domains, once data movement is orchestrated correctly, even a single instruction can make a difference

• New approach: Use random search to experiment with imprecise intermediate optimizations

• New opportunities for optimization: Identify applications that can tolerate a loss of precision and produce code that is specialized to that domain

• Download: www.github.com/eschkufz/stoke-release

32