using fma everywhere hurts performance cool one: fused multiply accumulate (fma)
TRANSCRIPT
Eric BrumerCompiler Developer
Native code performance on modern CPUs:A changing landscape
4-587
New CPUs have new instructions: SSE, SSE2, SSE3, SSE4.1, SSE4.2, XOP, AVX, AVX2
Using FMA everywhere hurts performance
Cool one: Fused multiply accumulate (FMA)
// ... stuff ... x[0] = y[0]; // 128b copyx[1] = y[1]; // 128b copy
// ... stuff ...
This may cause huge slowdowns on some chips
// ... stuff ...
x = y; // 256b copy
// ... stuff ...
“optimized”
What?
Let’s dive deep into performance
on modern CPUs.
Intel Pentium 4 (2001)
AMD Athlon 64 (2003)
128 bit SIMD instructions
/arch:SSE2
Visual Studio .NET 2003
Intel Sandy Bridge (2011)
AMD Bulldozer (2011)
FP 256 bit SIMD instructions
/arch:AVX
Visual Studio 2010
Intel Haswell (2013)Future AMD Chip (?)
256 bit SIMDinstructions
/arch:AVX2
Visual Studio 2013Update 2
(optimization support)
Intel Pentium 3 (1999)
AMD Athlon XP (2001)
Some 128 bit SIMD instructions
/arch:SSE
Visual C++ ?
New hotness!
Agenda
#1#2#3Recap
This should not be purely educational.
The surefire way to:1.Measure direct speedups2.Point to the right area for slowdown
Profiling your code
Visual Studio Performance Analyzer
AMD CodeXL
Intel Vtune Amplifier XE
Agenda
#1#2#3Recap
Fused multiply accumulate: FMA, FMAC, FUMAC
New CPU awesome-sauce
_mm_fmadd_ss, _mm_fmsub_ss, _mm_fnmadd_ss, _mm_fnmsub_ss, _mm_fmadd_sd, _mm_fmsub_sd, _mm_fnmadd_sd, _mm_fnmsub_sd, _mm_fmadd_ps, _mm_fmsub_ps, _mm_fnmadd_ps, _mm_fnmsub_ps, _mm_fmadd_pd, _mm_fmsub_pd, _mm_fnmadd_pd, _mm_fnmsub_pd, _mm256_fmadd_ps, _mm256_fmsub_ps, _mm256_fnmadd_ps, _mm256_fnmsub_ps, _mm256_fmadd_pd, _mm256_fmsub_pd, _mm256_fnmadd_pd, _mm256_fnmsub_pd
/arch:AVX2
Float mul = 5 cycle latency Float add = 3 cycle latency Float FMA = 5 cycle latency
New CPU awesome-sauce
Quiz: which is faster?No FMA FMA
res = A*B + C; res = FMADD A, B, C
x
+
res
A B
C5 cycles
3 cycles
Mult = 5 cyclesAdd = 3 cyclesFMA = 5 cycles
res
A B C
FMA 5 cycles
Quiz: which is faster?No FMA FMA
res = A*B + C*D; tmp = C*D res = FMADD A, B, tmp
x
+
res
A B
x
C D
res
A B
FMA
x
C D
5 cycles
3 cycles
5 cycles
5 cycles
Mult = 5 cyclesAdd = 3 cyclesFMA = 5 cycles
Quiz: which is faster?No FMA FMA
for (i=0; i<1000; i++) dp += A[i] * B[i];
t1 = t2 = 0; for (i=0; i<1000; i+=2) { t1 = FMADD A[i], B[i], t1 t2 = FMADD A[i+1], B[i+1], t2 } dp += t1 + t2
x
+
...
A[5] B[5]
dp
+
x
A[6] B[6] t1
...
A[6] B[6]
FMA
t2A[5]
FMA
B[5]
...
5 cycles3 cycles
5 cycles3 cycles
Mult = 5 cyclesAdd = 3 cyclesFMA = 5 cycles
FMA is a new CPU feature It’s hard to know when it is beneficial On AMD Steamroller, CPU cycles are
different! The C++ compiler will do it for you
Recap
Agenda
#1#2#3Recap
Highly optimized CPU code isn’t CPU code.
/arch:SSE2 and /arch:AVX provide 128-bit auto-vectorization
/arch:AVX2 provides 256-bit auto-vectorization
256 bit vectorization
for (i=0; i<1000; i++) A[i] = B[i] + C[i];
for (i=0; i<1000; i++) A[i] = B[i] + C[i];
for (i=0; i<1000; i+=4) xmm1 = vmovups B[i] xmm2 = vaddps xmm1, C[i] A[i] = vmovups xmm2autove
c
for (i=0; i<1000; i+=8) ymm1 = vmovups B[i] ymm2 = vaddps ymm1, C[i] A[i] = vmovups ymm2autove
c
Hint: no
Does 256-bit code run twice as fast as 128-bit code?
32-bit float scalar
Total: 100 ms
CPU: 80 ms Mem: 20 ms
128-bit SIMD
Total: 40 ms
CPU: 20 ms Mem: 20 ms
256-bit SIMD
Total: 30 ms
CPU: 10 ms
Mem: 20 ms
2.5x speedu
p
1.3x speedu
pMemory
Bound
Highly optimized CPU
code isn’t CPU code.
Windows task manager won’t help you here
Agenda
#1#2#3Recap
Doing performance analysis on Eigen3
Performance problem with a microbenchmark Compile /arch:AVX2 and it runs 60% slower than
/arch:SSE2 Key: it also happens with /arch:AVX
Performance bug
Courtesy of http://eigen.tuxfamily.org/
Performance bug
Compiled /arch:SSE2Run on Sandy Bridge
Compiled /arch:AVXRun on Sandy Bridge
8.5 ms 8.5 ms
Compiled /arch:SSE2Run on Haswell
Compiled /arch:AVXRun on Haswell
6.4 ms 10 ms
enhyay
this sucks
Performance bugstruct MyData { Vector4D v1; // 4 floats Vector4D v2; // 4 floats };
MyData x;MyData y;
void func2() { // ... unrelated stuff ... func3(); // ... unrelated stuff ...
x.v1 = y.v1; // 128-bit copy x.v2 = y.v2; // 128-bit copy
x = y; // 256-bit copy}
This caused the 60% slowdown on Haswell
Performance bug
Performance bug
Store buffers are awesome Relatively small “table” containing addresses &
data (Haswell: 42 entries) Subsequent loads fetch data from the table
Performance bug
BUT: some restrictions in store->load forwarding
Performance bug
bugs
deathlypotholes
Performance bugvoid func1() { for (int i = 0; i<10000; i++) func2();}
void func2() { // ... unrelated stuff ... func3(); // ... unrelated stuff ...
x = y; // 256-bit copy}
void func3() { // ... unrelated stuff ... ... = x.v1; // 128-bit load from x}
vmovups YMMWORD PTR [rbx], ymm0 mov rcx, QWORD PTR __$ArrayPad$[rsp] xor rcx, rsp call __security_check_cookie add rsp, 80 ; 00000050H pop rbx ret 0 push rbx sub rsp, 80 ; 00000050H mov rax, QWORD PTR __security_cookie xor rax, rsp mov QWORD PTR __$ArrayPad$[rsp], rax mov rbx, r8 mov r8, rdx mov rdx, rcx lea rcx, QWORD PTR $T1[rsp] mov rax, rsp mov QWORD PTR [rax+8], rbx mov QWORD PTR [rax+16], rsi push rdi sub rsp, 144 ; 00000090H vmovaps XMMWORD PTR [rax-24], xmm6 vmovaps XMMWORD PTR [rax-40], xmm7 vmovaps XMMWORD PTR [rax-56], xmm8 mov rsi, r8 mov rdi, rdx mov rbx, rcx vmovaps XMMWORD PTR [rax-72], xmm9 vmovaps XMMWORD PTR [rax-88], xmm10 vmovaps XMMWORD PTR [rax-104], xmm11 vmovaps XMMWORD PTR [rax-120], xmm12 vmovdqu xmm12, XMMWORD PTR __xmm@0000000000000000 test cl, 15 je SHORT $LN14@run lea rdx, OFFSET FLAT:??_C@_1FM@KGHGDLJC@ lea rcx, OFFSET FLAT:??_C@_1BIM@JPMPBING@ mov r8d, 78 ; 0000004eH call _wassert$LN14@run: vmovupd xmm11, XMMWORD PTR [rsi] vmovupd xmm10, XMMWORD PTR [rsi+16]
A “normal” optimization caused a store buffer pipeline stall
The store was nowhere near the load Caused a 60% regression in a microbenchmark Only reproduces on Intel Haswell
Performance bug
The performance landscape is changing. Get to know your
profiler.
Agenda
#1#2#3Recap
Intel Pentium 4 (2001)
AMD Athlon 64 (2003)
128 bit SIMD instructions
/arch:SSE2
Visual Studio .NET 2003
Intel Sandy Bridge (2011)
AMD Bulldozer (2011)
FP 256 bit SIMD instructions
/arch:AVX
Visual Studio 2010
Intel Haswell (2013)Future AMD Chip (?)
256 bit SIMDinstructions
/arch:AVX2
Visual Studio 2013Update 2
(optimization support)
Intel Pentium 3 (1999)
AMD Athlon XP (2001)
Some 128 bit SIMD instructions
/arch:SSE
Visual C++ ?
1. New CPU features2. Highly optimized CPU code is not CPU code3. Secondary effects of “normal” optimizations on
powerful CPUs
Recap
for MSDN Ultimate subscribers
Go to http://msdn.Microsoft.com/specialoffers
SPECIAL OFFERSPartner Program
Profile your code
Profile your code
Your Feedback is Important
Fill out an evaluation of this session and help shape future events.
Scan the QR code to evaluate this session on your mobile device.
You’ll also be entered into a daily prize drawing!
© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.