1 tips and tricks: visual c++ 2005 optimization best practices kang su gatlin tlnl04 program manager...
TRANSCRIPT
1
Tips and Tricks: Visual C++ Tips and Tricks: Visual C++ 2005 Optimization Best 2005 Optimization Best PracticesPractices
Kang Su GatlinKang Su GatlinTLNL04TLNL04Program ManagerProgram ManagerVisual C++Visual C++Microsoft CorporationMicrosoft Corporation
2
6 Tips/Best Practices To Help 6 Tips/Best Practices To Help Any C++ Dev Write Faster Any C++ Dev Write Faster CodeCodeManaged + UnmanagedManaged + Unmanaged1.1. Pick the right level of optimizationPick the right level of optimization2.2. Add instant parallelismAdd instant parallelism
UnmanagedUnmanaged3.3. Disambiguate memoryDisambiguate memory4.4. Use intrinsicsUse intrinsics
ManagedManaged5.5. Avoid double thunksAvoid double thunks6.6. Speed app startup timeSpeed app startup time
3
1. 1. Pick the Right Level Of Pick the Right Level Of OptimizationOptimization
Builds from the LabBuilds from the LabIf at all possible use Profile-Guided OptimizationIf at all possible use Profile-Guided Optimization
Only available unmanagedOnly available unmanagedMore on this next slideMore on this next slide
If not, use Whole Program Optimization (/GL)If not, use Whole Program Optimization (/GL)Available managed and unmanagedAvailable managed and unmanaged
After that we recommendAfter that we recommend/O2 (optimize for speed) for hot functions/files/O2 (optimize for speed) for hot functions/files/O1 (optimize for size) for the rest/O1 (optimize for size) for the rest
Other switches to use for maximum speedOther switches to use for maximum speed/Gy/Gy/OPT:REF,ICF (good size win on 64bit)/OPT:REF,ICF (good size win on 64bit)/fp:fast/fp:fast/arch:SSE2 (will not work on downlevel architectures)/arch:SSE2 (will not work on downlevel architectures)
Debug Symbols Are NOT Only for Debug BuildsDebug Symbols Are NOT Only for Debug BuildsExecutable size and codegen are NOT effected by this Executable size and codegen are NOT effected by this
It’s all in the PDB fileIt’s all in the PDB fileAlways building debug symbols will make life easierAlways building debug symbols will make life easierMake sure you use /OPT:REF,ICF, don’t use /ZI, and use Make sure you use /OPT:REF,ICF, don’t use /ZI, and use /INCREMENTAL:NO/INCREMENTAL:NO
4
Next-Gen Optimizations Next-Gen Optimizations Today Today Profile Guided OptimizationProfile Guided Optimization
The next level beyond Whole Program The next level beyond Whole Program OptimizationOptimization
Static compilers can’t answer everythingStatic compilers can’t answer everything
We get 20-50% improvement on large server We get 20-50% improvement on large server applications that we shipapplications that we ship
Current support is unmanaged onlyCurrent support is unmanaged only
if(a < b)if(a < b) foo();foo();elseelse baz();baz();
for(i = 0; i < count; ++i)for(i = 0; i < count; ++i) bar();bar();
Should we inline foo()?Should we inline foo()?
Should we unroll this loop?Should we unroll this loop?
5
Profile Guided Profile Guided OptimizationOptimization
Compile with /GLCompile with /GLSourceSource Object filesObject files
InstrumentedInstrumentedImageImage
ScenariosScenarios OutputOutput Profile data
Object filesObject files Link with /LTCG:PGILink with /LTCG:PGI InstrumentedInstrumentedImage + PGD fileImage + PGD file
Profile data
Object filesObject files
Link with /LTCG:PGOLink with /LTCG:PGO OptimizedOptimizedImageImage
There is throughput impactThere is throughput impact
6
What PGO Does And Does Not What PGO Does And Does Not DoDo
PGO doesPGO doesOptimizations galoreOptimizations galore
Speed/Size DeterminationSpeed/Size Determination
Switch expansionSwitch expansion
Better inlining decisionsBetter inlining decisions
Function/basic block layoutFunction/basic block layout
Virtual call speculationVirtual call speculation
Partial inliningPartial inlining
Optimize within a single imageOptimize within a single image
Merging and weighting of multiple scenariosMerging and weighting of multiple scenarios
PGO does not PGO does not No probing assembly language (inline or otherwise)No probing assembly language (inline or otherwise)
No optimizations across DLLsNo optimizations across DLLs
No data layout optimizationNo data layout optimization
7
PGO Compilation in Visual C++ PGO Compilation in Visual C++ 20052005
8
2. Add Instant Parallelism2. Add Instant ParallelismJust add OpenMP Pragmas!Just add OpenMP Pragmas!
OpenMP is a popular API for OpenMP is a popular API for multithreaded programsmultithreaded programs
Born from the HPC communityBorn from the HPC community
It consists of a set of simple #pragmas and It consists of a set of simple #pragmas and runtime routinesruntime routinesMost value parallelizing large loops with no Most value parallelizing large loops with no loop-dependenciesloop-dependenciesVisual C++ 2005 implements the full Visual C++ 2005 implements the full OpenMP 2.5 standardOpenMP 2.5 standard
Full unmanaged and/clr managed supportFull unmanaged and/clr managed supportSee the PDC issue of MSDN magazine for an See the PDC issue of MSDN magazine for an article article on OpenMPon OpenMP
9
OpenMP ParallelizationOpenMP Parallelization
void test(int first, int last) {void test(int first, int last) {
for (int i = first; for (int i = first; i <= last; ++i) {i <= last; ++i) { a[i] = b[i] * c[i];a[i] = b[i] * c[i]; }}}}
Each iteration is
independent;
order of execution
does not matter
if(x < 0) if(x < 0) a = foo(x);a = foo(x);elseelse a = x + 5;a = x + 5;b = bat(y);b = bat(y);c = baz(x + y);c = baz(x + y);j = a*b+c;j = a*b+c;
#pragma omp parallel for#pragma omp parallel for
#pragma omp parallel #pragma omp parallel sectionssections
{{ #pragma omp section#pragma omp section if(x < 0) if(x < 0) a = foo(x);a = foo(x); elseelse a = x + 5;a = x + 5; #pragma omp section#pragma omp section b = bat(y);b = bat(y); #pragma omp section#pragma omp section c = baz(x + y);c = baz(x + y);}}j = a+b+c;j = a+b+c;
Assignments to ‘a’,
‘b’, and ‘c’ are
independent
10
OpenMP Case StudyOpenMP Case StudyPanorama Factory by Smoky City Panorama Factory by Smoky City DesignDesign
Top-rated image stitching applicationTop-rated image stitching application
Added multithreading with OpenMP in Added multithreading with OpenMP in Visual C++ 2005 Beta2Visual C++ 2005 Beta2
Used 102 instances of #pragma omp Used 102 instances of #pragma omp **
Extremely impressive Results…Extremely impressive Results…Stitching together several large imagesStitching together several large images
Dual processor, dual core x64 machineDual processor, dual core x64 machine
11
Panorama Factory Speed Up Using OpenMP
0
0.5
1
1.5
2
2.5
3
3.5
1 2 3 4
Number of Threads
Sp
eed
Up
Rela
tive t
o S
ing
le-T
hre
ad
ed
Perf
orm
an
ce
Speed Up including I/O
Speed Up not includingI/O
12
3. Disambiguate Memory3. Disambiguate Memory
Programmer knows a and b never Programmer knows a and b never overlapoverlap
movmov edx, DWORD PTR [eax]edx, DWORD PTR [eax]movmov DWORD PTR [ecx], edxDWORD PTR [ecx], edxmovmov edx, DWORD PTR [eax+4]edx, DWORD PTR [eax+4]movmov DWORD PTR [ecx+4], edxDWORD PTR [ecx+4], edxmovmov edx, DWORD PTR [eax]edx, DWORD PTR [eax]movmov DWORD PTR [ecx+8], edxDWORD PTR [ecx+8], edxmovmov edx, DWORD PTR [eax+4]edx, DWORD PTR [eax+4]movmov DWORD PTR [ecx+12], edxDWORD PTR [ecx+12], edxmovmov edx, DWORD PTR [eax]edx, DWORD PTR [eax]movmov DWORD PTR [ecx+16], edxDWORD PTR [ecx+16], edxmovmov edx, DWORD PTR [eax+4]edx, DWORD PTR [eax+4]movmov DWORD PTR [ecx+20], edxDWORD PTR [ecx+20], edxmovmov edx, DWORD PTR [eax]edx, DWORD PTR [eax]movmov DWORD PTR [ecx+24], edxDWORD PTR [ecx+24], edxmovmov eax, DWORD PTR [eax+4]eax, DWORD PTR [eax+4]movmov DWORD PTR [ecx+28], eaxDWORD PTR [ecx+28], eax
ecx = a, eax = b ecx = a, eax = b
void copy8(int * a, void copy8(int * a, int * b) {int * b) {
a[0] = b[0];a[0] = b[0];a[1] = b[1];a[1] = b[1];a[2] = b[0];a[2] = b[0];a[3] = b[1];a[3] = b[1];a[4] = b[0];a[4] = b[0];a[5] = b[1];a[5] = b[1];a[6] = b[0];a[6] = b[0];a[7] = b[1];a[7] = b[1];
}}
13
Aliasing And Memory Aliasing And Memory DisambiguationDisambiguation
Aliasing is when one object can be used as Aliasing is when one object can be used as an alias to another objectan alias to another object
If compiler can NOT prove that an object If compiler can NOT prove that an object does not alias then it MUST assume it candoes not alias then it MUST assume it can
How can we address some of these How can we address some of these problems?problems?
1.1. Avoid taking address of an object.Avoid taking address of an object.2.2. Avoid taking address of a function.Avoid taking address of a function.3.3. Avoid using global variables. Statics are preferable.Avoid using global variables. Statics are preferable.4.4. UseUse __restrict__restrict, , __declspec(noalias)__declspec(noalias), , and and __declspec(restrict)__declspec(restrict)
when possible.when possible.
14
__restrict – A compiler __restrict – A compiler hinthint
Programmer knows a and b don’t Programmer knows a and b don’t overlapoverlap
void copy8(int * void copy8(int * __restrict__restrict a, a, int * b) {int * b) {
a[0] = b[0];a[0] = b[0];a[1] = b[1];a[1] = b[1];a[2] = b[0];a[2] = b[0];a[3] = b[1];a[3] = b[1];a[4] = b[0];a[4] = b[0];a[5] = b[1];a[5] = b[1];a[6] = b[0];a[6] = b[0];a[7] = b[1];a[7] = b[1];
}}
movmov ecx, DWORD PTR [edx]ecx, DWORD PTR [edx]movmov edx, DWORD PTR [edx+4]edx, DWORD PTR [edx+4]movmov DWORD PTR [eax], ecxDWORD PTR [eax], ecxmovmov DWORD PTR [eax+4], edxDWORD PTR [eax+4], edxmovmov DWORD PTR [eax+8], ecxDWORD PTR [eax+8], ecxmovmov DWORD PTR [eax+12], edxDWORD PTR [eax+12], edxmovmov DWORD PTR [eax+16], ecxDWORD PTR [eax+16], ecxmovmov DWORD PTR [eax+20], edxDWORD PTR [eax+20], edxmovmov DWORD PTR [eax+24], ecxDWORD PTR [eax+24], ecxmovmov DWORD PTR [eax+28], edxDWORD PTR [eax+28], edx
eax = a, edx = beax = a, edx = b
15
__declspec(restrict)__declspec(restrict)
Tells the compiler that the function returns Tells the compiler that the function returns an unalised pointeran unalised pointer
Only applicable to functionsOnly applicable to functions
This is a promise the programmer makes to This is a promise the programmer makes to the compilerthe compiler
If this promise is violated the compiler may If this promise is violated the compiler may generate bad codegenerate bad code
The CRT uses this decoration, e.g., malloc, The CRT uses this decoration, e.g., malloc, calloc, etc…calloc, etc…
__declspec(restrict) void *malloc(int size);__declspec(restrict) void *malloc(int size);
16
__declspec(noalias)__declspec(noalias)
Tells the compiler that the function is a Tells the compiler that the function is a semi-pure functionsemi-pure function
Only references locals, arguments, and first-Only references locals, arguments, and first-level indirections of argumentslevel indirections of arguments
This is a promise the programmer makes to This is a promise the programmer makes to
the compilerthe compiler
If this promise is violated the compiler may If this promise is violated the compiler may generate bad codegenerate bad code
__declspec(noalias) void isElement(Tree *t, Element e);__declspec(noalias) void isElement(Tree *t, Element e);
17
4. Use Intrinsics4. Use IntrinsicsSimply represented as functions to the Simply represented as functions to the programmerprogrammer
_mm_load_pd(double const*);_mm_load_pd(double const*);
Compilers understand these as primitivesCompilers understand these as primitivesAllows the user to get right at the hardware w/o Allows the user to get right at the hardware w/o using asmusing asmAlmost anything you can do in assemblyAlmost anything you can do in assembly
interlock, memory fences, cache control, SIMDinterlock, memory fences, cache control, SIMDThe key to things such as vectorization andThe key to things such as vectorization and lock-free programming lock-free programming
You can use intrinsics in a file compiled /clr, but You can use intrinsics in a file compiled /clr, but the function(s) will be compiled as unmanagedthe function(s) will be compiled as unmanagedIntrinsics are consumed by PGO and our optimizerIntrinsics are consumed by PGO and our optimizer
Inline asm is notInline asm is not
Documentation for intrinsics is much better in Documentation for intrinsics is much better in Visual C++ 2005 Visual C++ 2005 [Visual Studio 8]\VC\include\intrin.h[Visual Studio 8]\VC\include\intrin.h
18
Matrix Addition With Matrix Addition With IntrinsicsIntrinsics
void MatMatAdd(Matrix &a, Matrix &b, Matrix &c) {void MatMatAdd(Matrix &a, Matrix &b, Matrix &c) { for(int i = 0; i < a.m_rows; ++i)for(int i = 0; i < a.m_rows; ++i) for(int j = 0; j < a.m_cols; j++)for(int j = 0; j < a.m_cols; j++) cc[i][j] = a[i][j] + b[i][j];[i][j] = a[i][j] + b[i][j];}}
#include <intrin.h>#include <intrin.h>void MatMatAddVect(Matrix &a, Matrix &b, Matrix &c) {void MatMatAddVect(Matrix &a, Matrix &b, Matrix &c) { __m128__m128 aSIMD, bSIMD, cSIMD; aSIMD, bSIMD, cSIMD; for(int i = 0; i < a.m_rows; ++i)for(int i = 0; i < a.m_rows; ++i) for(int j = 0; j < a.m_cols; for(int j = 0; j < a.m_cols; jj += 4+= 4)) {{ aSIMD = aSIMD = _mm_load_ps_mm_load_ps(&a[i][j]);(&a[i][j]); bSIMD = bSIMD = _mm_load_ps_mm_load_ps(&b[i][j]);(&b[i][j]); cSIMD= cSIMD= _mm_add_ps_mm_add_ps(aSIMD, bSIMD);(aSIMD, bSIMD); _mm_store_ps_mm_store_ps(&c[i][j], cSIMD);(&c[i][j], cSIMD); }}}}
19
Spin-Lock With IntrinsicsSpin-Lock With Intrinsics
#include <intrin.h>#include <intrin.h>#include <windows.h>#include <windows.h>
void EnterSpinLock(volatile long &lock) {void EnterSpinLock(volatile long &lock) {while(while(_InterlockedCompareExchange(&lock, 1, 0)_InterlockedCompareExchange(&lock, 1, 0) != 0) != 0)
Sleep(0);Sleep(0);}}
void ExitSpinLock(volatile long &lock) {void ExitSpinLock(volatile long &lock) {lock = 0;lock = 0;
}}
20
5. Avoid Double-Thunks5. Avoid Double-Thunks
Thunks are functions used to Thunks are functions used to transition from managed to transition from managed to unmanaged (and vice-versa)unmanaged (and vice-versa)
Managed CodeManaged Code
UnmanagedFunc();UnmanagedFunc();
Unmanaged CodeUnmanaged Code
UnmanagedFunc() { … }UnmanagedFunc() { … }
Managed ToManaged ToUnmanaged Unmanaged
ThunkThunk
Thunks are a part of life…Thunks are a part of life…but sometimes we can have but sometimes we can have
Double Thunks…Double Thunks…
21
Double ThunkingDouble Thunking
From managed to managed onlyFrom managed to managed onlyIndirect callsIndirect calls
Function pointers and virtual functionsFunction pointers and virtual functions
Is the callee is managed or unmanaged entry point?Is the callee is managed or unmanaged entry point?
__declspec(dllexport)__declspec(dllexport)No current mechanism to export functions as managed entry No current mechanism to export functions as managed entry pointspoints
Managed CodeManaged Code
ManagedFunc();ManagedFunc();
Managed CodeManaged Code
ManagedFunc() { … }ManagedFunc() { … }
Managed ToManaged ToUnmanaged Unmanaged
ThunkThunk
Unmanaged ToUnmanaged ToManaged Managed
ThunkThunk
22
How To Fix Double How To Fix Double ThunkingThunking
Indirect Functions (including Virtual Indirect Functions (including Virtual Funcs)Funcs)
Compile with /clr:pureCompile with /clr:pure
Use __clrcallUse __clrcall
__declspec(export)__declspec(export)Wrap functions in a managed class, and Wrap functions in a managed class, and then #using the object filethen #using the object file
23
Using __clrcall To Using __clrcall To Improve PerformanceImprove Performance
24
6. Speed App Startup 6. Speed App Startup TimeTime
No one likes to wait for an app to No one likes to wait for an app to start-upstart-up
There is still some time associated There is still some time associated with loading CLRwith loading CLR
In some apps you may have non-CLR In some apps you may have non-CLR pathspaths
Only load the CLR when you need toOnly load the CLR when you need to
Use DelayLoading technology in the Use DelayLoading technology in the linkerlinker
If the EXE is compiled /clr then we will If the EXE is compiled /clr then we will always load the CLRalways load the CLR
25
Delay Loading The CLRDelay Loading The CLR
26
Summary Of Best Summary Of Best PracticesPractices
Managed + UnmanagedManaged + Unmanaged1.1. Use PGO for unmanaged and WPO for managed…Use PGO for unmanaged and WPO for managed…2.2. OpenMP can ease multithreaded development.OpenMP can ease multithreaded development.
UnmanagedUnmanaged3.3. Make it easier for the compiler to track pointers.Make it easier for the compiler to track pointers.4.4. Intrinsics give the ability to get to the metal.Intrinsics give the ability to get to the metal.
ManagedManaged5.5. Know where your double thunks are and fix.Know where your double thunks are and fix.6.6. Delay load the CLR to improve startup.Delay load the CLR to improve startup.
Large and ongoing investment in Large and ongoing investment in managedmanaged and and unmanagedunmanaged C++ code C++ code
27
ResourcesResourcesVisual C++ Dev CenterVisual C++ Dev Center
http://http://msdn.microsoft.com/visualcmsdn.microsoft.com/visualc
This is the place to go for all our news This is the place to go for all our news and whitepapersand whitepapers
[email protected]@microsoft.com
http://http://blogs.msdn.com/kangsublogs.msdn.com/kangsu
Must See TalksMust See TalksTLN309 C++: Future Directions in TLN309 C++: Future Directions in Language Innovation with Herb Sutter Language Innovation with Herb Sutter (Friday 10:30am)(Friday 10:30am)
28
© 2005 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.