where the wild things are - benchmarking and micro-optimisations
TRANSCRIPT
Premature Optimization
“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.Yet we should not pass up our opportunities in that critical 3%.“
- Donald Knuth
Profiling Tools
• ANTS Performance Profiler - Redgate
• dotTrace & dotMemory - Jet Brains
• PerfView - Microsoft (free)
• Visual Studio Profiling Tools (Ultimate, Premium or Professional)
• MiniProfiler - Stack Overflow (free)
static void Profile(int iterations, Action action){
action(); // warm up GC.Collect(); // clean up
var watch = new Stopwatch();watch.Start();for (int i = 0; i < iterations; i++){
action();}watch.Stop();
Console.WriteLine("Time Elapsed {0} ms", watch.ElapsedMilliseconds);}
Benchmarking small code samples in C#, can this implementation be improved?http://stackoverflow.com/q/1047218/4500
private static T Result;static void Profile<T>(int iterations, Func<T> func){
func(); // warm up GC.Collect(); // clean up
var watch = new Stopwatch();watch.Start();for (int i = 0; i < iterations; i++){
Result = func();}watch.Stop();
Console.WriteLine("Time Elapsed {0} ms", watch.ElapsedMilliseconds);}
Benchmarking small code samples in C#, can this implementation be improved?http://stackoverflow.com/q/1047218/4500
BenchmarkDotNet project
Andrey Akinshin (the ‘Boss’)
@andrey_akinshin
http://aakinshin.net/en/blog/
Matt Warren (me)
Adam Sitnik (.NET Core guru)
@SitnikAdam
http://adamsitnik.com/
Benchmarking library that is:
•Accurate•Easy-to-use
•Helpful
Stopwatch under the hood http://aakinshin.net/en/blog/dotnet/stopwatch/
LegacyJIT-x86 and first method call http://aakinshin.net/en/blog/dotnet/legacyjitx86-and-first-method-call/
Goals of BenchmarkDotNet
What BenchmarkDotNet doesn’t do
• Multi-threaded benchmarks
• Integrate with C.I builds
• Unit test runner integration
• Anything else? http://github.com/dotnet/BenchmarkDotNet/issues/
“Other Benchmarking tools are available”
• NBench• https://github.com/petabridge/NBench
• Microsoft Xunit performance • http://github.com/Microsoft/xunit-performance/
• Lambda Micro Benchmarking (“Clash of the Lambdas”) • https://github.com/biboudis/LambdaMicrobenchmarking
• Etimo.Benchmarks• http://etimo.se/blog/etimo-benchmarks-lightweight-net-benchmark-tool/
• MeasureIt• https://blogs.msdn.microsoft.com/vancem/2009/02/06/measureit-update-tool-for-
doing-microbenchmarks-for-net/
How it works
An invocation of the target method is an operation.
A bunch of operations is an iteration.
Iteration types:
• Pilot: The best operation count will be chosen.
• IdleWarmup, IdleTarget: BenchmarkDotNet overhead will be evaluated.
• MainWarmup: Warmup of the main method.
• MainTarget: Main measurements.
• Result = MainTarget – AverageOverhead
http://benchmarkdotnet.org/HowItWorks.htm
Scale of benchmarks
• millisecond - ms• One thousandth of one second, single webapp request
• microsecond - us or µs• One millionth of one second, several in-memory operations
• nanosecond - ns• One billionth of one second, single operations
Who ‘times’ the timers?[Benchmark]public long StopwatchLatency(){
return Stopwatch.GetTimestamp();}
[Benchmark]public long StopwatchGranularity(){
// Loop until Stopwatch.GetTimestamp()// gives us a different valuelong lastTimestamp =
Stopwatch.GetTimestamp();while (Stopwatch.GetTimestamp() ==
lastTimestamp){}return lastTimestamp;
}
[Benchmark]public long DateTimeLatency(){
return DateTime.Now.Ticks;}
[Benchmark]public long DateTimeGranularity(){
// Loop until DateTime.Now// gives us a different valuelong lastTimestamp = DateTime.Now.Ticks;while (DateTime.Now.Ticks == lastTimestamp){}return lastTimestamp;
}
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
Job-FIDMNL : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
Method | Mean | StdDev | Allocated |
--------------------- |---------------- |------------ |---------- |
StopwatchLatency | ?? ns | ?? ns | ?? B |
StopwatchGranularity | ?? ns | ?? ns | ?? B |
DateTimeLatency | ?? ns | ?? ns | ?? B |
DateTimeGranularity | ?? ns | ?? ns | ?? B |
Who ‘times’ the timers?
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
Job-FIDMNL : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
Method | Mean | StdDev | Allocated |
--------------------- |---------------- |------------ |---------- |
StopwatchLatency | 12.9960 ns | 0.1609 ns | 0 B |
StopwatchGranularity | 374.3049 ns | 2.4388 ns | 0 B |
DateTimeLatency | 682.2320 ns | 8.9341 ns | 32 B |
DateTimeGranularity | 996,025.6492 ns | 413.9175 ns | 47.34 kB |
Who ‘times’ the timers?
Loop-the-Loop”Avoid foreach loop on everything except raw arrays?”
[Benchmark(Baseline = true)]public int ForLoopArray(){
var counter = 0;for (int i = 0; i < anArray.Length; i++)
counter += anArray[i];return counter;
}
[Benchmark]public int ForEachArray(){
var counter = 0;foreach (var i in anArray)
counter += i;return counter;
}
[Benchmark]public int ForLoopList(){
var counter = 0;for (int i = 0; i < aList.Count; i++)
counter += aList[i];return counter;
}
[Benchmark]public int ForEachList(){
var counter = 0;foreach (var i in aList)
counter += i;return counter;
}
Loop-the-Loop”Avoid foreach loop on everything except raw arrays?”
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
Method | Mean | StdDev | Scaled | Scaled-StdDev |
--------------- |-------------- |------------ |------- |-------------- |
ForLoopArray | ?? ns | | ?? | |
ForEachArray | ?? ns | | ?? | |
ForLoopList | ?? ns | | ?? | |
ForEachList | ?? ns | | ?? | |
Loop-the-Loop”Avoid foreach loop on everything except raw arrays?”
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
Method | Mean | StdDev | Scaled | Scaled-StdDev |
--------------- |-------------- |------------ |------- |-------------- |
ForLoopArray | 383.8279 ns | 2.9472 ns | 1.00 | 0.00 |
ForEachArray | 392.5611 ns | 4.1286 ns | 1.02 | 0.01 |
ForLoopList | 2,315.9658 ns | 12.1001 ns | 6.03 | 0.05 |
ForEachList | 2,663.5771 ns | 21.9822 ns | 6.94 | 0.08 |
Abstractions - IDictionary v Dictionary
Dictionary<string, string> dictionary =new Dictionary<string, string>();
IDictionary<string, string> iDictionary =(IDictionary<string, string>)dictionary;
[Benchmark]public Dictionary<string, string> DictionaryEnumeration(){
foreach (var item in dictionary) { ; }return dictionary;
}
[Benchmark]public IDictionary<string, string> IDictionaryEnumeration(){
foreach (var item in iDictionary) { ; }return iDictionary;
}
Abstractions - IDictionary v Dictionary
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
Method | Mean | StdErr | StdDev | Gen 0 | Allocated |
----------------------- |----------- |---------- |---------- |------- |---------- |
DictionaryEnumeration | ?? ns | ?? ns | ?? ns | ?? | ?? B |
IDictionaryEnumeration | ?? ns | ?? ns | ?? ns | ?? | ?? B |
// * Diagnostic Output - MemoryDiagnoser *
Note: the Gen 0/1/2 Measurements are per 1k Operations
Abstractions - IDictionary v Dictionary
BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
Method | Mean | StdErr | StdDev | Gen 0 | Allocated |
----------------------- |----------- |---------- |---------- |------- |---------- |
DictionaryEnumeration | 24.0353 ns | 0.2403 ns | 0.9307 ns | - | 0 B |
IDictionaryEnumeration | 41.6301 ns | 0.4479 ns | 2.1944 ns | 0.0086 | 32 B |
// * Diagnostic Output - MemoryDiagnoser *
Note: the Gen 0/1/2 Measurements are per 1k Operations
Abstractions - IDictionary v Dictionary
Dictionary<string, string> dictionary =new Dictionary<string, string>();
IDictionary<string, string> iDictionary =(IDictionary<string, string>)dictionary;
// struct – so doesn't allocateDictionary<string, string>.Enumerator enumerator =
dictionary.GetEnumerator();
// interface - allocates 56 B (64-bit) and 32 B (32-bit)IEnumerator<KeyValuePair<string, string>> enumerator =
iDictionary.GetEnumerator();
Low-level increments[LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job]public class Program{
private double a, b, c, d;
[Benchmark(OperationsPerInvoke = 4)]public void MethodA(){
a++; b++; c++; d++;}
[Benchmark(OperationsPerInvoke = 4)]public void MethodB(){
a++; a++; a++; a++;}
}
Low-level incrementsBenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1590.0;compatjit-v4.6.1590.0LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1590.0
Runtime=Clr Allocated=0 B
Method | Job | Jit | Platform | Mean | StdErr | StdDev |----------- |------------- |---------- |--------- |---------- |---------- |---------- |
Parallel | LegacyJitX64 | LegacyJit | X64 | ?? ns | ?? ns | ?? ns |Sequential | LegacyJitX64 | LegacyJit | X64 | ?? ns | ?? ns | ?? ns |Parallel | LegacyJitX86 | LegacyJit | X86 | ?? ns | ?? ns | ?? ns |
Sequential | LegacyJitX86 | LegacyJit | X86 | ?? ns | ?? ns | ?? ns |Parallel | RyuJitX64 | RyuJit | X64 | ?? ns | ?? ns | ?? ns |
Sequential | RyuJitX64 | RyuJit | X64 | ?? ns | ?? ns | ?? ns |
MethodA = Parallel, MethodB() = Sequential
Low-level incrementsBenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1590.0;compatjit-v4.6.1590.0LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1590.0
Runtime=Clr Allocated=0 B
Method | Job | Jit | Platform | Mean | StdErr | StdDev |----------- |------------- |---------- |--------- |---------- |---------- |---------- |
Parallel | LegacyJitX64 | LegacyJit | X64 | 0.3420 ns | 0.0015 ns | 0.0057 ns |Sequential | LegacyJitX64 | LegacyJit | X64 | 2.2038 ns | 0.0014 ns | 0.0051 ns |Parallel | LegacyJitX86 | LegacyJit | X86 | 0.3276 ns | 0.0005 ns | 0.0020 ns |
Sequential | LegacyJitX86 | LegacyJit | X86 | 2.5229 ns | 0.0048 ns | 0.0187 ns |Parallel | RyuJitX64 | RyuJit | X64 | 0.3686 ns | 0.0037 ns | 0.0144 ns |
Sequential | RyuJitX64 | RyuJit | X64 | 0.8959 ns | 0.0023 ns | 0.0090 ns |
MethodA = Parallel, MethodB() = Sequential
http://en.wikipedia.org/wiki/Instruction-level_parallelism
Search - Linear v Binaryprivate static int LinearSearch(
Data[] set, int key){
for (int i = 0; i < set.Length; i++){
var c = set[i].Key - key;if (c == 0){
return i;}if (c > 0){
return ~i;}
}return ~set.Length;
}
private static int BinarySearch(Data[] set, int key)
{int i = 0;int up = set.Length - 1;while (i <= up){
int mid = (up - i) / 2 + i;int c = set[mid].Key - key;if (c == 0){
return mid;}if (c < 0)
i = mid + 1;else
up = mid - 1;}return ~i;
}
Search - Linear v Binary
private readonly Data[][] dataSet;private Data[] currentSet;private int currentMid;private int currentMax;
[Params(1, 2, 3, 4, 5, 7, 10, 12, 15)]public int Size{
set{
currentSet = dataSet[value];currentMax = value - 1;currentMid = value / 2;
}}
readonly fieldspublic struct Int256{
private readonly long bits0, bits1,bits2, bits3;
public Int256(long bits0, long bits1,long bits2, long bits3)
{this.bits0 = bits0; this.bits1 = bits1;this.bits2 = bits2; this.bits3 = bits3;
}
public long Bits0 { get { return bits0; } }public long Bits1 { get { return bits1; } }public long Bits2 { get { return bits2; } }public long Bits3 { get { return bits3; } }
}
private readonly Int256 readOnlyField =new Int256(1L, 5L, 10L, 100L);
private Int256 field =new Int256(1L, 5L, 10L, 100L);
[LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job]public class Program{
[Benchmark]public long GetValue(){
return field.Bits0 + field.Bits1 +field.Bits2 + field.Bits3;
}
[Benchmark]public long GetReadOnlyValue(){
return readOnlyField.Bits0 +readOnlyField.Bits1 +readOnlyField.Bits2 +readOnlyField.Bits3;
}}
readonly fieldsBenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1590.0;compatjit-v4.6.1590.0
LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1590.0
Runtime=Clr Allocated=0 B
Method | Job | Jit | Platform | Mean | StdErr | StdDev |
----------------- |------------- |---------- |--------- |---------- |---------- |---------- |
GetValue | LegacyJitX64 | LegacyJit | X64 | ?? ns | ?? ns | ?? ns |
GetReadOnlyValue | LegacyJitX64 | LegacyJit | X64 | ?? ns | ?? ns | ?? ns |
GetValue | LegacyJitX86 | LegacyJit | X86 | ?? ns | ?? ns | ?? ns |
GetReadOnlyValue | LegacyJitX86 | LegacyJit | X86 | ?? ns | ?? ns | ?? ns |
GetValue | RyuJitX64 | RyuJit | X64 | ?? ns | ?? ns | ?? ns |
GetReadOnlyValue | RyuJitX64 | RyuJit | X64 | ?? ns | ?? ns | ?? ns |
readonly fieldsBenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1590.0;compatjit-v4.6.1590.0
LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0
RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1590.0
Runtime=Clr Allocated=0 B
Method | Job | Jit | Platform | Mean | StdErr | StdDev |
----------------- |------------- |---------- |--------- |---------- |---------- |---------- |
GetValue | LegacyJitX64 | LegacyJit | X64 | 0.7893 ns | 0.0078 ns | 0.0291 ns |
GetReadOnlyValue | LegacyJitX64 | LegacyJit | X64 | 9.5362 ns | 0.0251 ns | 0.0971 ns |
GetValue | LegacyJitX86 | LegacyJit | X86 | 1.4625 ns | 0.0506 ns | 0.1959 ns |
GetReadOnlyValue | LegacyJitX86 | LegacyJit | X86 | 1.9743 ns | 0.0641 ns | 0.2481 ns |
GetValue | RyuJitX64 | RyuJit | X64 | 0.3852 ns | 0.0183 ns | 0.0710 ns |
GetReadOnlyValue | RyuJitX64 | RyuJit | X64 | 9.6406 ns | 0.0803 ns | 0.3109 ns |
https://codeblog.jonskeet.uk/2014/07/16/micro-optimization-the-surprising-inefficiency-of-readonly-fields/
MOAR Benchmarks!!Analysing Optimisations in the Wire Serialiser
• http://mattwarren.org/2016/08/23/Analysing-Optimisations-in-the-Wire-Serialiser/
Optimising LINQ• http://mattwarren.org/2016/09/29/Optimising-LINQ/
Why is reflection slow?• http://mattwarren.org/2016/12/14/Why-is-Reflection-slow/
Why Exceptions should be Exceptional• http://mattwarren.org/2016/12/20/Why-Exceptions-should-be-Exceptional/