automatically adapting programs for mixed-precision floating-point computation

27
Automatically Adapting Automatically Adapting Programs Programs for Mixed-Precision for Mixed-Precision Floating-Point Computation Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis de Supinski and Matt LeGendre Lawrence Livermore National Lab

Upload: corina

Post on 16-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Automatically Adapting Programs for Mixed-Precision Floating-Point Computation. Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis de Supinski and Matt LeGendre Lawrence Livermore National Lab. Background. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

Automatically Adapting Automatically Adapting ProgramsPrograms

for Mixed-Precision for Mixed-Precision Floating-Point ComputationFloating-Point Computation

Mike Lam and Jeff Hollingsworth

University of Maryland, College Park

Bronis de Supinski and Matt LeGendre

Lawrence Livermore National Lab

Page 2: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

BackgroundBackground• Floating point represents real numbers as (± sgnf × 2exp)

o Sign bito Exponento Significand (“mantissa” or “fraction”)

• Finite precisiono Single-precision: 24 bits (~7 decimal digits)o Double-precision: 53 bits (~16 decimal digits)o Introduces rounding error

032 16 8 4

Significand (23 bits)Exponent (8 bits)

IEEE Single

2

03264 16 8 4

Significand (52 bits)Exponent (11 bits)

IEEE Double

Page 3: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

MotivationMotivation• Double precision is ubiquitous

o Necessary for some computationso Lack of easy-to-use techniques for reasoning about precision

• Single precision is preferableo Faster computation

o Tesla K20X: 2.95 TFlops (singles) vs. 1.31 TFlops (doubles)

o Intel Xeon Phi: 2.15 GFlops (singles) vs. 1.07 GFlops (doubles)

o Standard CPUs: 2x operations w/ SSE vector operationso Reduced memory pressure

o Up to 50% footprint reductiono Data movement is a bottleneck for some domains

Desire: Balance speed (singles) with accuracy (doubles) 3

Page 4: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

Mixed PrecisionMixed Precision

4

1: LU ← PA2: solve Ly = Pb3: solve Ux0 = y4: for k = 1, 2, ... do5: rk ← b – Axk-1

6: solve Ly = Prk

7: solve Uzk = y8: xk ← xk-1 + zk

9: check for convergence10: end for

Red text indicates steps performed in double-precision (all other steps are single-precision)

Mixed-precision linear solver algorithm

• Use double precision where necessary• Use single precision where possible• Nearly 2x speedups [Baboulin2008]

Page 5: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

Our GoalOur Goal

Use automated analysis techniques to prototype mixed-precision

variants and provide insight about a program’s precision level

requirements.

5

Page 6: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

FrameworkFrameworkCRAFT: Configurable Runtime Analysis

for Floating-point Tuning

•Static binary instrumentationo Parse binary on disko Replace or augment floating-point instructions with new

codeo Rewrite modified binary

•Dynamic analysiso Run modified program on representative data seto Produce results and recommendations

6

Page 7: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

Previous WorkPrevious Work• Cancellation detection [WHIST’11]

o Reports loss of precision due to subtractiono Provides insight regarding numerical behavior

• Range trackingo Reports per-instruction min/max valueso Provides insight regarding low dynamic ranges

• Mixed-precision variantso Replaces double-precision instructions and operandso Provides insight regarding precision-level sensitivity

7

Page 8: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

downcast conversion

• In-place replacemento Narrowed focus: doubles singleso In-place downcast conversiono Flag in the high bits to indicate replacement

03264 16 8 4

Double

03264 16 8 4ReplacedDouble

7 F F 4 D E A D

Non-signalling NaN 032 16 8 4

Single

8

ImplementationImplementation

Page 9: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

ExampleExample

gvec[i,j] = gvec[i,j] * lvec[3] + gvar

1 movsd 0x601e38(%rax, %rbx, 8) %xmm0

2 mulsd -0x78(%rsp) * %xmm0 %xmm0

3 addsd -0x4f02(%rip) + %xmm0 %xmm0

4 movsd %xmm0 0x601e38(%rax, %rbx, 8)

9

Page 10: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

ExampleExample

gvec[i,j] = gvec[i,j] * lvec[3] + gvar

1 movsd 0x601e38(%rax, %rbx, 8) %xmm0

2 mulss -0x78(%rsp) * %xmm0 %xmm0

3 addss -0x4f02(%rip) + %xmm0 %xmm0

4 movsd %xmm0 0x601e38(%rax, %rbx, 8)

10

Page 11: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

gvec[i,j] = gvec[i,j] * lvec[3] + gvar

1 movsd 0x601e38(%rax, %rbx, 8) %xmm0check/replace -0x78(%rsp) and %xmm0

2 mulss -0x78(%rsp) * %xmm0 %xmm0check/replace -0x4f02(%rip) and %xmm0

3 addss -0x4f02(%rip) + %xmm0 %xmm0

4 movsd %xmm0 0x601e38(%rax, %rbx, 8)

11

ExampleExample

Page 12: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

Replacement CodeReplacement Code push %rax push %rbx

<for each input operand> <copy input into %rax> mov %rbx, 0xffffffff00000000 and %rax, %rbx # extract high word mov %rbx, 0x7ff4dead00000000 test %rax, %rbx # check for flag je next # skip if replaced <copy input into %rax> cvtsd2ss %rax, %rax # down-cast value or %rax, %rbx # set flag <copy %rax back into input>next: <next operand> pop %rbx pop %rax

<replaced instruction> # e.g. addsd => addss

12

Page 13: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

DyninstDyninst

• Binary analysis frameworko Parses executable files (InstructionAPI & ParseAPI)o Inserts instrumentation (DyninstAPI)o Supports full binary modification (PatchAPI)o Rewrites binary executable files (SymtabAPI)

dyninst.org

13

Page 14: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

Block EditingBlock Editing

14

double single conversion

original instruction in block

block splits

initializationcheck/replace

Page 15: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

OverheadOverhead

15

Benchmark(name.CLASS)

Average Overhead

bt.A 50.6X

cg.A 6.1X

ep.A 13.8X

ft.A 10.1X

lu.A 28.5X

mg.A 14.0X

sp.A 19.5X

Page 16: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

Binary EditingBinary Editing

16

Original Binary

(“mutatee”)

Modified Binary

CRAFT(“mutator”)

Double Precision

Mixed Precision

MixedConfig

Configuration

(parser & GUI)

Page 17: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

ConfigurationConfiguration

17

Page 18: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

Automated SearchAutomated Search

• Manual mixed-precision replacemento Hard to use without intuition regarding potential

replacements

• Automatic mixed-precision analysiso Try lots of configurations (empirical auto-tuning)o Test with user-defined verification routine and data seto Exploit program control structure: replace larger structures

(modules, functions) firsto If coarse-grained replacements fail, try finer-grained

subcomponent replacements

18

Page 19: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

System OverviewSystem Overview

19

Page 20: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

Example ResultsExample Results

20

Page 21: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

Example ResultsExample Results

21

Page 22: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

NAS ResultsNAS Results

22

Benchmark(name.CLASS)

CandidateInstructions

Configurations Tested

Instructions Replaced

% Static % Dynamic

bt.W 6,647 3,854 76.2 85.7

bt.A 6,682 3,832 75.9 81.6

cg.W 940 270 93.7 6.4

cg.A 934 229 94.7 5.3

ep.W 397 112 93.7 30.7

ep.A 397 113 93.1 23.9

ft.W 422 72 84.4 0.3

ft.A 422 73 93.6 0.2

lu.W 5,957 3,769 73.7 65.5

lu.A 5,929 2,814 80.4 69.4

mg.W 1,351 458 84.4 28.0

mg.A 1,351 456 84.1 24.4

sp.W 4,772 5,729 36.9 45.8

sp.A 4,821 5,044 51.9 43.0

Page 23: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

AMGmk ResultsAMGmk Results

24

• Algebraic MultiGrid microkernel• Multigrid method is iterative and highly adaptive

• Good candidate for replacement

• Automatic search• Complete conversion (100% replacement)

• Manually-rewritten version• Speedup: 175 sec to 95 sec (1.8X)

• Conventional x86_64 hardware

Page 24: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

SuperLU ResultsSuperLU Results

25

• Package for LU decomposition and linear solves• Reports final error residual (useful for threshholding)

• Both single- and double-precision versions

• Verified manual conversion via automatic search• Used error from provided single-precision version as threshold

• Final config matched single-precision profile (99.9% replacement)

Threshold Instructions Replaced

% Static % Dynamic

Final Error

1.0e-03 99.1 99.9 1.59e-04

1.0e-04 94.1 87.3 4.42e-05

7.5e-05 91.3 52.5 4.40e-05

5.0e-05 87.9 45.2 3.00e-05

2.5e-05 80.3 26.6 1.69e-05

1.0e-05 75.4 1.6 7.15e-07

1.0e-06 72.6 1.6 4.7e7-07

Page 25: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

Future WorkFuture Work

• Memory-based analysis

• Case studies

• Search optimization

26

Page 26: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

ConclusionConclusion

Automated binary modification can build prototype mixed-precision program variants.

Automated search can provide insight to focus mixed-precision implementation efforts.

27

Page 27: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation

Thank you!Thank you!

sf.net/p/crafthpc

28