automatically adapting programs for mixed-precision floating-point computation

Automatically Adapting Automatically Adapting ProgramsPrograms

for Mixed-Precision for Mixed-Precision Floating-Point ComputationFloating-Point Computation

Mike Lam and Jeff Hollingsworth

University of Maryland, College Park

Bronis de Supinski and Matt LeGendre

Lawrence Livermore National Lab

BackgroundBackground• Floating point represents real numbers as (± sgnf × 2exp)

o Sign bito Exponento Significand (“mantissa” or “fraction”)

• Finite precisiono Single-precision: 24 bits (~7 decimal digits)o Double-precision: 53 bits (~16 decimal digits)o Introduces rounding error

032 16 8 4

Significand (23 bits)Exponent (8 bits)

IEEE Single

2

03264 16 8 4

Significand (52 bits)Exponent (11 bits)

IEEE Double

MotivationMotivation• Double precision is ubiquitous

o Necessary for some computationso Lack of easy-to-use techniques for reasoning about precision

• Single precision is preferableo Faster computation

o Tesla K20X: 2.95 TFlops (singles) vs. 1.31 TFlops (doubles)

o Intel Xeon Phi: 2.15 GFlops (singles) vs. 1.07 GFlops (doubles)

o Standard CPUs: 2x operations w/ SSE vector operationso Reduced memory pressure

o Up to 50% footprint reductiono Data movement is a bottleneck for some domains

Desire: Balance speed (singles) with accuracy (doubles) 3

Mixed PrecisionMixed Precision

4

1: LU ← PA2: solve Ly = Pb3: solve Ux0 = y4: for k = 1, 2, ... do5: rk ← b – Axk-1

6: solve Ly = Prk

7: solve Uzk = y8: xk ← xk-1 + zk

9: check for convergence10: end for

Red text indicates steps performed in double-precision (all other steps are single-precision)

Mixed-precision linear solver algorithm

• Use double precision where necessary• Use single precision where possible• Nearly 2x speedups [Baboulin2008]

Our GoalOur Goal

Use automated analysis techniques to prototype mixed-precision

variants and provide insight about a program’s precision level

requirements.

5

FrameworkFrameworkCRAFT: Configurable Runtime Analysis

for Floating-point Tuning

•Static binary instrumentationo Parse binary on disko Replace or augment floating-point instructions with new

codeo Rewrite modified binary

•Dynamic analysiso Run modified program on representative data seto Produce results and recommendations

6

Previous WorkPrevious Work• Cancellation detection [WHIST’11]

o Reports loss of precision due to subtractiono Provides insight regarding numerical behavior

• Range trackingo Reports per-instruction min/max valueso Provides insight regarding low dynamic ranges

• Mixed-precision variantso Replaces double-precision instructions and operandso Provides insight regarding precision-level sensitivity

7

downcast conversion

• In-place replacemento Narrowed focus: doubles singleso In-place downcast conversiono Flag in the high bits to indicate replacement

03264 16 8 4

Double

03264 16 8 4ReplacedDouble

7 F F 4 D E A D

Non-signalling NaN 032 16 8 4

Single

8

ImplementationImplementation

ExampleExample

gvec[i,j] = gvec[i,j] * lvec[3] + gvar

1 movsd 0x601e38(%rax, %rbx, 8) %xmm0

2 mulsd -0x78(%rsp) * %xmm0 %xmm0

3 addsd -0x4f02(%rip) + %xmm0 %xmm0

4 movsd %xmm0 0x601e38(%rax, %rbx, 8)

9

ExampleExample


1 movsd 0x601e38(%rax, %rbx, 8) %xmm0

2 mulss -0x78(%rsp) * %xmm0 %xmm0

3 addss -0x4f02(%rip) + %xmm0 %xmm0


10


1 movsd 0x601e38(%rax, %rbx, 8) %xmm0check/replace -0x78(%rsp) and %xmm0

2 mulss -0x78(%rsp) * %xmm0 %xmm0check/replace -0x4f02(%rip) and %xmm0

3 addss -0x4f02(%rip) + %xmm0 %xmm0


11

ExampleExample

Replacement CodeReplacement Code push %rax push %rbx

<for each input operand> <copy input into %rax> mov %rbx, 0xffffffff00000000 and %rax, %rbx # extract high word mov %rbx, 0x7ff4dead00000000 test %rax, %rbx # check for flag je next # skip if replaced <copy input into %rax> cvtsd2ss %rax, %rax # down-cast value or %rax, %rbx # set flag <copy %rax back into input>next: <next operand> pop %rbx pop %rax

<replaced instruction> # e.g. addsd => addss

12

DyninstDyninst

• Binary analysis frameworko Parses executable files (InstructionAPI & ParseAPI)o Inserts instrumentation (DyninstAPI)o Supports full binary modification (PatchAPI)o Rewrites binary executable files (SymtabAPI)

dyninst.org

13

Block EditingBlock Editing

14

double single conversion

original instruction in block

block splits

initializationcheck/replace

OverheadOverhead

15

Benchmark(name.CLASS)

Average Overhead

bt.A 50.6X

cg.A 6.1X

ep.A 13.8X

ft.A 10.1X

lu.A 28.5X

mg.A 14.0X

sp.A 19.5X

Binary EditingBinary Editing

16

Original Binary

(“mutatee”)

Modified Binary

CRAFT(“mutator”)

Double Precision

Mixed Precision

MixedConfig

Configuration

(parser & GUI)

ConfigurationConfiguration

17

Automated SearchAutomated Search

• Manual mixed-precision replacemento Hard to use without intuition regarding potential

replacements

• Automatic mixed-precision analysiso Try lots of configurations (empirical auto-tuning)o Test with user-defined verification routine and data seto Exploit program control structure: replace larger structures

(modules, functions) firsto If coarse-grained replacements fail, try finer-grained

subcomponent replacements

18

System OverviewSystem Overview

19

Example ResultsExample Results

20

Example ResultsExample Results

21

NAS ResultsNAS Results

22

Benchmark(name.CLASS)

CandidateInstructions

Configurations Tested

Instructions Replaced

% Static % Dynamic

bt.W 6,647 3,854 76.2 85.7

bt.A 6,682 3,832 75.9 81.6

cg.W 940 270 93.7 6.4

cg.A 934 229 94.7 5.3

ep.W 397 112 93.7 30.7

ep.A 397 113 93.1 23.9

ft.W 422 72 84.4 0.3

ft.A 422 73 93.6 0.2

lu.W 5,957 3,769 73.7 65.5

lu.A 5,929 2,814 80.4 69.4

mg.W 1,351 458 84.4 28.0

mg.A 1,351 456 84.1 24.4

sp.W 4,772 5,729 36.9 45.8

sp.A 4,821 5,044 51.9 43.0

AMGmk ResultsAMGmk Results

24

• Algebraic MultiGrid microkernel• Multigrid method is iterative and highly adaptive

• Good candidate for replacement

• Automatic search• Complete conversion (100% replacement)

• Manually-rewritten version• Speedup: 175 sec to 95 sec (1.8X)

• Conventional x86_64 hardware

SuperLU ResultsSuperLU Results

25

• Package for LU decomposition and linear solves• Reports final error residual (useful for threshholding)

• Both single- and double-precision versions

• Verified manual conversion via automatic search• Used error from provided single-precision version as threshold

• Final config matched single-precision profile (99.9% replacement)

Threshold Instructions Replaced

% Static % Dynamic

Final Error

1.0e-03 99.1 99.9 1.59e-04

1.0e-04 94.1 87.3 4.42e-05

7.5e-05 91.3 52.5 4.40e-05

5.0e-05 87.9 45.2 3.00e-05

2.5e-05 80.3 26.6 1.69e-05

1.0e-05 75.4 1.6 7.15e-07

1.0e-06 72.6 1.6 4.7e7-07

Future WorkFuture Work

• Memory-based analysis

• Case studies

• Search optimization

26

ConclusionConclusion

Automated binary modification can build prototype mixed-precision program variants.

Automated search can provide insight to focus mixed-precision implementation efforts.

27

Thank you!Thank you!

sf.net/p/crafthpc

28

automatically adapting programs for mixed-precision floating-point computation

Documents

mixed precision

precisionsingle precision

xmm0 0x601e38

necessaryuse single

precisionlevel sensitivity

decimal digitsdoubleprecision

rbx mov

rax push