using fma everywhere hurts performance cool one: fused multiply accumulate (fma)

Eric BrumerCompiler Developer

Native code performance on modern CPUs:A changing landscape

4-587

New CPUs have new instructions: SSE, SSE2, SSE3, SSE4.1, SSE4.2, XOP, AVX, AVX2

Using FMA everywhere hurts performance

Cool one: Fused multiply accumulate (FMA)

// ... stuff ... x[0] = y[0]; // 128b copyx[1] = y[1]; // 128b copy

// ... stuff ...

This may cause huge slowdowns on some chips

// ... stuff ...

x = y; // 256b copy

// ... stuff ...

“optimized”

Let’s dive deep into performance

on modern CPUs.

Intel Pentium 4 (2001)

AMD Athlon 64 (2003)

128 bit SIMD instructions

/arch:SSE2

Visual Studio .NET 2003

Intel Sandy Bridge (2011)

AMD Bulldozer (2011)

FP 256 bit SIMD instructions

/arch:AVX

Visual Studio 2010

Intel Haswell (2013)Future AMD Chip (?)

256 bit SIMDinstructions

/arch:AVX2

Visual Studio 2013Update 2

(optimization support)


AMD Athlon XP (2001)

Some 128 bit SIMD instructions

/arch:SSE

Visual C++ ?

New hotness!

Agenda

#1#2#3Recap

This should not be purely educational.

The surefire way to:1.Measure direct speedups2.Point to the right area for slowdown

Profiling your code

Visual Studio Performance Analyzer

AMD CodeXL

Intel Vtune Amplifier XE

Agenda

#1#2#3Recap

Fused multiply accumulate: FMA, FMAC, FUMAC

New CPU awesome-sauce

_mm_fmadd_ss, _mm_fmsub_ss, _mm_fnmadd_ss, _mm_fnmsub_ss, _mm_fmadd_sd, _mm_fmsub_sd, _mm_fnmadd_sd, _mm_fnmsub_sd, _mm_fmadd_ps, _mm_fmsub_ps, _mm_fnmadd_ps, _mm_fnmsub_ps, _mm_fmadd_pd, _mm_fmsub_pd, _mm_fnmadd_pd, _mm_fnmsub_pd, _mm256_fmadd_ps, _mm256_fmsub_ps, _mm256_fnmadd_ps, _mm256_fnmsub_ps, _mm256_fmadd_pd, _mm256_fmsub_pd, _mm256_fnmadd_pd, _mm256_fnmsub_pd

/arch:AVX2

Float mul = 5 cycle latency Float add = 3 cycle latency Float FMA = 5 cycle latency

New CPU awesome-sauce

Quiz: which is faster?No FMA FMA

res = A*B + C; res = FMADD A, B, C

x

+

res

A B

C5 cycles

3 cycles

Mult = 5 cyclesAdd = 3 cyclesFMA = 5 cycles

res

A B C

FMA 5 cycles


res = A*B + C*D; tmp = C*D res = FMADD A, B, tmp

x

+

res

A B

x

C D

res

A B

FMA

x

C D

5 cycles

3 cycles

5 cycles

5 cycles



for (i=0; i<1000; i++) dp += A[i] * B[i];

t1 = t2 = 0; for (i=0; i<1000; i+=2) { t1 = FMADD A[i], B[i], t1 t2 = FMADD A[i+1], B[i+1], t2 } dp += t1 + t2

x

+

...

A[5] B[5]

dp

+

x

A[6] B[6] t1

...

A[6] B[6]

FMA

t2A[5]

FMA

B[5]

...

5 cycles3 cycles

5 cycles3 cycles


FMA is a new CPU feature It’s hard to know when it is beneficial On AMD Steamroller, CPU cycles are

different! The C++ compiler will do it for you

Recap

Agenda

#1#2#3Recap

Highly optimized CPU code isn’t CPU code.

/arch:SSE2 and /arch:AVX provide 128-bit auto-vectorization

/arch:AVX2 provides 256-bit auto-vectorization

256 bit vectorization

for (i=0; i<1000; i++) A[i] = B[i] + C[i];

for (i=0; i<1000; i++) A[i] = B[i] + C[i];

for (i=0; i<1000; i+=4) xmm1 = vmovups B[i] xmm2 = vaddps xmm1, C[i] A[i] = vmovups xmm2autove

c

for (i=0; i<1000; i+=8) ymm1 = vmovups B[i] ymm2 = vaddps ymm1, C[i] A[i] = vmovups ymm2autove

c

Hint: no

Does 256-bit code run twice as fast as 128-bit code?

32-bit float scalar

Total: 100 ms

CPU: 80 ms Mem: 20 ms

128-bit SIMD

Total: 40 ms

CPU: 20 ms Mem: 20 ms

256-bit SIMD

Total: 30 ms

CPU: 10 ms

Mem: 20 ms

2.5x speedu

p

1.3x speedu

pMemory

Bound

Highly optimized CPU

code isn’t CPU code.

Windows task manager won’t help you here

Agenda

#1#2#3Recap

Doing performance analysis on Eigen3

Performance problem with a microbenchmark Compile /arch:AVX2 and it runs 60% slower than

/arch:SSE2 Key: it also happens with /arch:AVX

Performance bug

Courtesy of http://eigen.tuxfamily.org/

Performance bug

Compiled /arch:SSE2Run on Sandy Bridge

Compiled /arch:AVXRun on Sandy Bridge

8.5 ms 8.5 ms

Compiled /arch:SSE2Run on Haswell

Compiled /arch:AVXRun on Haswell

6.4 ms 10 ms

enhyay

this sucks

Performance bugstruct MyData { Vector4D v1; // 4 floats Vector4D v2; // 4 floats };

MyData x;MyData y;

void func2() { // ... unrelated stuff ... func3(); // ... unrelated stuff ...

x.v1 = y.v1; // 128-bit copy x.v2 = y.v2; // 128-bit copy

x = y; // 256-bit copy}

This caused the 60% slowdown on Haswell

Performance bug

Store buffers are awesome Relatively small “table” containing addresses &

data (Haswell: 42 entries) Subsequent loads fetch data from the table

Performance bug

BUT: some restrictions in store->load forwarding

Performance bug

bugs

deathlypotholes

Performance bugvoid func1() { for (int i = 0; i<10000; i++) func2();}

void func2() { // ... unrelated stuff ... func3(); // ... unrelated stuff ...

x = y; // 256-bit copy}

void func3() { // ... unrelated stuff ... ... = x.v1; // 128-bit load from x}

vmovups YMMWORD PTR [rbx], ymm0 mov rcx, QWORD PTR __$ArrayPad$[rsp] xor rcx, rsp call __security_check_cookie add rsp, 80 ; 00000050H pop rbx ret 0 push rbx sub rsp, 80 ; 00000050H mov rax, QWORD PTR __security_cookie xor rax, rsp mov QWORD PTR __$ArrayPad$[rsp], rax mov rbx, r8 mov r8, rdx mov rdx, rcx lea rcx, QWORD PTR $T1[rsp] mov rax, rsp mov QWORD PTR [rax+8], rbx mov QWORD PTR [rax+16], rsi push rdi sub rsp, 144 ; 00000090H vmovaps XMMWORD PTR [rax-24], xmm6 vmovaps XMMWORD PTR [rax-40], xmm7 vmovaps XMMWORD PTR [rax-56], xmm8 mov rsi, r8 mov rdi, rdx mov rbx, rcx vmovaps XMMWORD PTR [rax-72], xmm9 vmovaps XMMWORD PTR [rax-88], xmm10 vmovaps XMMWORD PTR [rax-104], xmm11 vmovaps XMMWORD PTR [rax-120], xmm12 vmovdqu xmm12, XMMWORD PTR __xmm@0000000000000000 test cl, 15 je SHORT $LN14@run lea rdx, OFFSET FLAT:??_C@_1FM@KGHGDLJC@ lea rcx, OFFSET FLAT:??_C@_1BIM@JPMPBING@ mov r8d, 78 ; 0000004eH call _wassert$LN14@run: vmovupd xmm11, XMMWORD PTR [rsi] vmovupd xmm10, XMMWORD PTR [rsi+16]

A “normal” optimization caused a store buffer pipeline stall

The store was nowhere near the load Caused a 60% regression in a microbenchmark Only reproduces on Intel Haswell

Performance bug

The performance landscape is changing. Get to know your

profiler.

Agenda

#1#2#3Recap


AMD Athlon 64 (2003)

128 bit SIMD instructions

/arch:SSE2

Visual Studio .NET 2003

Intel Sandy Bridge (2011)

AMD Bulldozer (2011)

FP 256 bit SIMD instructions

/arch:AVX

Visual Studio 2010

Intel Haswell (2013)Future AMD Chip (?)

256 bit SIMDinstructions

/arch:AVX2

Visual Studio 2013Update 2

(optimization support)


AMD Athlon XP (2001)

Some 128 bit SIMD instructions

/arch:SSE

Visual C++ ?

1. New CPU features2. Highly optimized CPU code is not CPU code3. Secondary effects of “normal” optimizations on

powerful CPUs

Recap

for MSDN Ultimate subscribers

Go to http://msdn.Microsoft.com/specialoffers

SPECIAL OFFERSPartner Program

http://msdn.microsoft.com/specialoffers

Profile your code

Your Feedback is Important

Fill out an evaluation of this session and help shape future events.

Scan the QR code to evaluate this session on your mobile device.

You’ll also be entered into a daily prize drawing!

© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

using fma everywhere hurts performance cool one: fused multiply accumulate (fma)

Documents

fma slide

avx2 slide

optimized slide

cycles fma

cycles res

cycles mult

pd arch

simd instructions arch