using fma everywhere hurts performance cool one: fused multiply accumulate (fma)

44

Upload: trevor-welch

Post on 23-Dec-2015

232 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)
Page 2: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Eric BrumerCompiler Developer

Native code performance on modern CPUs:A changing landscape

4-587

Page 3: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

New CPUs have new instructions: SSE, SSE2, SSE3, SSE4.1, SSE4.2, XOP, AVX, AVX2

Using FMA everywhere hurts performance

Cool one: Fused multiply accumulate (FMA)

Page 4: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

// ... stuff ... x[0] = y[0]; // 128b copyx[1] = y[1]; // 128b copy

// ... stuff ...

This may cause huge slowdowns on some chips

// ... stuff ...

x = y; // 256b copy

// ... stuff ...

“optimized”

Page 5: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

What?

Page 6: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Let’s dive deep into performance

on modern CPUs.

Page 7: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Intel Pentium 4 (2001)

AMD Athlon 64 (2003)

128 bit SIMD instructions

/arch:SSE2

Visual Studio .NET 2003

Intel Sandy Bridge (2011)

AMD Bulldozer (2011)

FP 256 bit SIMD instructions

/arch:AVX

Visual Studio 2010

Intel Haswell (2013)Future AMD Chip (?)

256 bit SIMDinstructions

/arch:AVX2

Visual Studio 2013Update 2

(optimization support)

Intel Pentium 3 (1999)

AMD Athlon XP (2001)

Some 128 bit SIMD instructions

/arch:SSE

Visual C++ ?

New hotness!

Page 8: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Agenda

#1#2#3Recap

Page 9: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

This should not be purely educational.

Page 10: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

The surefire way to:1.Measure direct speedups2.Point to the right area for slowdown

Profiling your code

Page 11: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Visual Studio Performance Analyzer

Page 12: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

AMD CodeXL

Page 13: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Intel Vtune Amplifier XE

Page 14: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Agenda

#1#2#3Recap

Page 15: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Fused multiply accumulate: FMA, FMAC, FUMAC

New CPU awesome-sauce

_mm_fmadd_ss, _mm_fmsub_ss, _mm_fnmadd_ss, _mm_fnmsub_ss, _mm_fmadd_sd, _mm_fmsub_sd, _mm_fnmadd_sd, _mm_fnmsub_sd, _mm_fmadd_ps, _mm_fmsub_ps, _mm_fnmadd_ps, _mm_fnmsub_ps, _mm_fmadd_pd, _mm_fmsub_pd, _mm_fnmadd_pd, _mm_fnmsub_pd, _mm256_fmadd_ps, _mm256_fmsub_ps, _mm256_fnmadd_ps, _mm256_fnmsub_ps, _mm256_fmadd_pd, _mm256_fmsub_pd, _mm256_fnmadd_pd, _mm256_fnmsub_pd

/arch:AVX2

Page 16: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Float mul = 5 cycle latency Float add = 3 cycle latency Float FMA = 5 cycle latency

New CPU awesome-sauce

Page 17: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Quiz: which is faster?No FMA FMA

res = A*B + C; res = FMADD A, B, C

x

+

res

A B

C5 cycles

3 cycles

Mult = 5 cyclesAdd = 3 cyclesFMA = 5 cycles

res

A B C

FMA 5 cycles

Page 18: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Quiz: which is faster?No FMA FMA

res = A*B + C*D; tmp = C*D res = FMADD A, B, tmp

x

+

res

A B

x

C D

res

A B

FMA

x

C D

5 cycles

3 cycles

5 cycles

5 cycles

Mult = 5 cyclesAdd = 3 cyclesFMA = 5 cycles

Page 19: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Quiz: which is faster?No FMA FMA

for (i=0; i<1000; i++) dp += A[i] * B[i];

t1 = t2 = 0; for (i=0; i<1000; i+=2) { t1 = FMADD A[i], B[i], t1 t2 = FMADD A[i+1], B[i+1], t2 } dp += t1 + t2

x

+

...

A[5] B[5]

dp

+

x

A[6] B[6] t1

...

A[6] B[6]

FMA

t2A[5]

FMA

B[5]

...

5 cycles3 cycles

5 cycles3 cycles

Mult = 5 cyclesAdd = 3 cyclesFMA = 5 cycles

Page 20: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

FMA is a new CPU feature It’s hard to know when it is beneficial On AMD Steamroller, CPU cycles are

different! The C++ compiler will do it for you

Recap

Page 21: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Agenda

#1#2#3Recap

Page 22: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Highly optimized CPU code isn’t CPU code.

Page 23: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

/arch:SSE2 and /arch:AVX provide 128-bit auto-vectorization

/arch:AVX2 provides 256-bit auto-vectorization

256 bit vectorization

for (i=0; i<1000; i++) A[i] = B[i] + C[i];

for (i=0; i<1000; i++) A[i] = B[i] + C[i];

for (i=0; i<1000; i+=4) xmm1 = vmovups B[i] xmm2 = vaddps xmm1, C[i] A[i] = vmovups xmm2autove

c

for (i=0; i<1000; i+=8) ymm1 = vmovups B[i] ymm2 = vaddps ymm1, C[i] A[i] = vmovups ymm2autove

c

Page 24: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Hint: no

Does 256-bit code run twice as fast as 128-bit code?

32-bit float scalar

Total: 100 ms

CPU: 80 ms Mem: 20 ms

128-bit SIMD

Total: 40 ms

CPU: 20 ms Mem: 20 ms

256-bit SIMD

Total: 30 ms

CPU: 10 ms

Mem: 20 ms

2.5x speedu

p

1.3x speedu

pMemory

Bound

Highly optimized CPU

code isn’t CPU code.

Page 25: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Windows task manager won’t help you here

Page 26: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)
Page 27: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Agenda

#1#2#3Recap

Page 28: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Doing performance analysis on Eigen3

Performance problem with a microbenchmark Compile /arch:AVX2 and it runs 60% slower than

/arch:SSE2 Key: it also happens with /arch:AVX

Performance bug

Courtesy of http://eigen.tuxfamily.org/

Page 29: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Performance bug

Compiled /arch:SSE2Run on Sandy Bridge

Compiled /arch:AVXRun on Sandy Bridge

8.5 ms 8.5 ms

Compiled /arch:SSE2Run on Haswell

Compiled /arch:AVXRun on Haswell

6.4 ms 10 ms

enhyay

this sucks

Page 30: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Performance bugstruct MyData { Vector4D v1; // 4 floats Vector4D v2; // 4 floats };

MyData x;MyData y;

void func2() { // ... unrelated stuff ... func3(); // ... unrelated stuff ...

x.v1 = y.v1; // 128-bit copy x.v2 = y.v2; // 128-bit copy

x = y; // 256-bit copy}

This caused the 60% slowdown on Haswell

Page 31: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Performance bug

Page 32: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Performance bug

Page 33: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Store buffers are awesome Relatively small “table” containing addresses &

data (Haswell: 42 entries) Subsequent loads fetch data from the table

Performance bug

Page 34: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

BUT: some restrictions in store->load forwarding

Performance bug

bugs

deathlypotholes

Page 35: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Performance bugvoid func1() { for (int i = 0; i<10000; i++) func2();}

void func2() { // ... unrelated stuff ... func3(); // ... unrelated stuff ...

x = y; // 256-bit copy}

void func3() { // ... unrelated stuff ... ... = x.v1; // 128-bit load from x}

 vmovups YMMWORD PTR [rbx], ymm0 mov rcx, QWORD PTR __$ArrayPad$[rsp] xor rcx, rsp call __security_check_cookie add rsp, 80     ; 00000050H pop rbx ret 0 push rbx sub rsp, 80     ; 00000050H mov rax, QWORD PTR __security_cookie xor rax, rsp mov QWORD PTR __$ArrayPad$[rsp], rax mov rbx, r8 mov r8, rdx mov rdx, rcx lea rcx, QWORD PTR $T1[rsp]   mov rax, rsp mov QWORD PTR [rax+8], rbx mov QWORD PTR [rax+16], rsi push rdi sub rsp, 144    ; 00000090H vmovaps XMMWORD PTR [rax-24], xmm6 vmovaps XMMWORD PTR [rax-40], xmm7 vmovaps XMMWORD PTR [rax-56], xmm8 mov rsi, r8 mov rdi, rdx mov rbx, rcx vmovaps XMMWORD PTR [rax-72], xmm9 vmovaps XMMWORD PTR [rax-88], xmm10 vmovaps XMMWORD PTR [rax-104], xmm11 vmovaps XMMWORD PTR [rax-120], xmm12 vmovdqu xmm12, XMMWORD PTR __xmm@0000000000000000 test cl, 15 je SHORT $LN14@run lea rdx, OFFSET FLAT:??_C@_1FM@KGHGDLJC@ lea rcx, OFFSET FLAT:??_C@_1BIM@JPMPBING@ mov r8d, 78     ; 0000004eH call _wassert$LN14@run: vmovupd xmm11, XMMWORD PTR [rsi] vmovupd xmm10, XMMWORD PTR [rsi+16]

Page 36: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

A “normal” optimization caused a store buffer pipeline stall

The store was nowhere near the load Caused a 60% regression in a microbenchmark Only reproduces on Intel Haswell

Performance bug

The performance landscape is changing. Get to know your

profiler.

Page 37: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Agenda

#1#2#3Recap

Page 38: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Intel Pentium 4 (2001)

AMD Athlon 64 (2003)

128 bit SIMD instructions

/arch:SSE2

Visual Studio .NET 2003

Intel Sandy Bridge (2011)

AMD Bulldozer (2011)

FP 256 bit SIMD instructions

/arch:AVX

Visual Studio 2010

Intel Haswell (2013)Future AMD Chip (?)

256 bit SIMDinstructions

/arch:AVX2

Visual Studio 2013Update 2

(optimization support)

Intel Pentium 3 (1999)

AMD Athlon XP (2001)

Some 128 bit SIMD instructions

/arch:SSE

Visual C++ ?

Page 39: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

1. New CPU features2. Highly optimized CPU code is not CPU code3. Secondary effects of “normal” optimizations on

powerful CPUs

Recap

Page 40: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

for MSDN Ultimate subscribers

Go to http://msdn.Microsoft.com/specialoffers

SPECIAL OFFERSPartner Program

Page 41: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Profile your code

Page 42: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Profile your code

Page 43: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

Your Feedback is Important

Fill out an evaluation of this session and help shape future events.

Scan the QR code to evaluate this session on your mobile device.

You’ll also be entered into a daily prize drawing!

Page 44: Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)

© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.