where the wild things are - benchmarking and micro-optimisations

47
WHERE THE WILD THINGS ARE Benchmarking and Micro-Optimisations

Upload: matt-warren

Post on 28-Jan-2018

466 views

Category:

Technology


1 download

TRANSCRIPT

WHERE THE WILD THINGS ARE

Benchmarking and

Micro-Optimisations

Matt Warren@matthewwarren

http://mattwarren.org/

Premature Optimization

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.Yet we should not pass up our opportunities in that critical 3%.“

- Donald Knuth

Profiling Tools

• ANTS Performance Profiler - Redgate

• dotTrace & dotMemory - Jet Brains

• PerfView - Microsoft (free)

• Visual Studio Profiling Tools (Ultimate, Premium or Professional)

• MiniProfiler - Stack Overflow (free)

BenchmarkDotNet

Why do you need a benchmarking library?

static void Profile(int iterations, Action action){

action(); // warm up GC.Collect(); // clean up

var watch = new Stopwatch();watch.Start();for (int i = 0; i < iterations; i++){

action();}watch.Stop();

Console.WriteLine("Time Elapsed {0} ms", watch.ElapsedMilliseconds);}

Benchmarking small code samples in C#, can this implementation be improved?http://stackoverflow.com/q/1047218/4500

private static T Result;static void Profile<T>(int iterations, Func<T> func){

func(); // warm up GC.Collect(); // clean up

var watch = new Stopwatch();watch.Start();for (int i = 0; i < iterations; i++){

Result = func();}watch.Stop();

Console.WriteLine("Time Elapsed {0} ms", watch.ElapsedMilliseconds);}

Benchmarking small code samples in C#, can this implementation be improved?http://stackoverflow.com/q/1047218/4500

BenchmarkDotNet project

Andrey Akinshin (the ‘Boss’)

@andrey_akinshin

http://aakinshin.net/en/blog/

Matt Warren (me)

Adam Sitnik (.NET Core guru)

@SitnikAdam

http://adamsitnik.com/

.NET Foundation

Goals of BenchmarkDotNet

Benchmarking library that is:

•Accurate

•Easy-to-use

•Helpful

Benchmarking library that is:

•Accurate•Easy-to-use

•Helpful

Stopwatch under the hood http://aakinshin.net/en/blog/dotnet/stopwatch/

LegacyJIT-x86 and first method call http://aakinshin.net/en/blog/dotnet/legacyjitx86-and-first-method-call/

Goals of BenchmarkDotNet

Proper docs!benchmarkdotnet.org/

What BenchmarkDotNet doesn’t do

• Multi-threaded benchmarks

• Integrate with C.I builds

• Unit test runner integration

• Anything else? http://github.com/dotnet/BenchmarkDotNet/issues/

“Other Benchmarking tools are available”

• NBench• https://github.com/petabridge/NBench

• Microsoft Xunit performance • http://github.com/Microsoft/xunit-performance/

• Lambda Micro Benchmarking (“Clash of the Lambdas”) • https://github.com/biboudis/LambdaMicrobenchmarking

• Etimo.Benchmarks• http://etimo.se/blog/etimo-benchmarks-lightweight-net-benchmark-tool/

• MeasureIt• https://blogs.msdn.microsoft.com/vancem/2009/02/06/measureit-update-tool-for-

doing-microbenchmarks-for-net/

How it works

An invocation of the target method is an operation.

A bunch of operations is an iteration.

Iteration types:

• Pilot: The best operation count will be chosen.

• IdleWarmup, IdleTarget: BenchmarkDotNet overhead will be evaluated.

• MainWarmup: Warmup of the main method.

• MainTarget: Main measurements.

• Result = MainTarget – AverageOverhead

http://benchmarkdotnet.org/HowItWorks.htm

What happens under the covers?

Image credit Albert Rodríguez @UncleFirefox

DEMO‘Hello World’ Benchmark

Scale of benchmarks

• millisecond - ms• One thousandth of one second, single webapp request

• microsecond - us or µs• One millionth of one second, several in-memory operations

• nanosecond - ns• One billionth of one second, single operations

Who ‘times’ the timers?[Benchmark]public long StopwatchLatency(){

return Stopwatch.GetTimestamp();}

[Benchmark]public long StopwatchGranularity(){

// Loop until Stopwatch.GetTimestamp()// gives us a different valuelong lastTimestamp =

Stopwatch.GetTimestamp();while (Stopwatch.GetTimestamp() ==

lastTimestamp){}return lastTimestamp;

}

[Benchmark]public long DateTimeLatency(){

return DateTime.Now.Ticks;}

[Benchmark]public long DateTimeGranularity(){

// Loop until DateTime.Now// gives us a different valuelong lastTimestamp = DateTime.Now.Ticks;while (DateTime.Now.Ticks == lastTimestamp){}return lastTimestamp;

}

BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1

Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8

Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC

[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0

Job-FIDMNL : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0

Method | Mean | StdDev | Allocated |

--------------------- |---------------- |------------ |---------- |

StopwatchLatency | ?? ns | ?? ns | ?? B |

StopwatchGranularity | ?? ns | ?? ns | ?? B |

DateTimeLatency | ?? ns | ?? ns | ?? B |

DateTimeGranularity | ?? ns | ?? ns | ?? B |

Who ‘times’ the timers?

BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1

Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8

Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC

[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0

Job-FIDMNL : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0

Method | Mean | StdDev | Allocated |

--------------------- |---------------- |------------ |---------- |

StopwatchLatency | 12.9960 ns | 0.1609 ns | 0 B |

StopwatchGranularity | 374.3049 ns | 2.4388 ns | 0 B |

DateTimeLatency | 682.2320 ns | 8.9341 ns | 32 B |

DateTimeGranularity | 996,025.6492 ns | 413.9175 ns | 47.34 kB |

Who ‘times’ the timers?

Loop-the-Loop”Avoid foreach loop on everything except raw arrays?”

[Benchmark(Baseline = true)]public int ForLoopArray(){

var counter = 0;for (int i = 0; i < anArray.Length; i++)

counter += anArray[i];return counter;

}

[Benchmark]public int ForEachArray(){

var counter = 0;foreach (var i in anArray)

counter += i;return counter;

}

[Benchmark]public int ForLoopList(){

var counter = 0;for (int i = 0; i < aList.Count; i++)

counter += aList[i];return counter;

}

[Benchmark]public int ForEachList(){

var counter = 0;foreach (var i in aList)

counter += i;return counter;

}

Loop-the-Loop”Avoid foreach loop on everything except raw arrays?”

BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1

Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8

Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC

[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0

DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0

Method | Mean | StdDev | Scaled | Scaled-StdDev |

--------------- |-------------- |------------ |------- |-------------- |

ForLoopArray | ?? ns | | ?? | |

ForEachArray | ?? ns | | ?? | |

ForLoopList | ?? ns | | ?? | |

ForEachList | ?? ns | | ?? | |

Loop-the-Loop”Avoid foreach loop on everything except raw arrays?”

BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1

Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8

Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC

[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0

DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0

Method | Mean | StdDev | Scaled | Scaled-StdDev |

--------------- |-------------- |------------ |------- |-------------- |

ForLoopArray | 383.8279 ns | 2.9472 ns | 1.00 | 0.00 |

ForEachArray | 392.5611 ns | 4.1286 ns | 1.02 | 0.01 |

ForLoopList | 2,315.9658 ns | 12.1001 ns | 6.03 | 0.05 |

ForEachList | 2,663.5771 ns | 21.9822 ns | 6.94 | 0.08 |

Loop-the-Loop – ‘for loop’ - Arrays

Loop-the-Loop – ‘for loop’ - Lists

Abstractions - IDictionary v Dictionary

Dictionary<string, string> dictionary =new Dictionary<string, string>();

IDictionary<string, string> iDictionary =(IDictionary<string, string>)dictionary;

[Benchmark]public Dictionary<string, string> DictionaryEnumeration(){

foreach (var item in dictionary) { ; }return dictionary;

}

[Benchmark]public IDictionary<string, string> IDictionaryEnumeration(){

foreach (var item in iDictionary) { ; }return iDictionary;

}

Abstractions - IDictionary v Dictionary

BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1

Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8

Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC

[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0

DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0

Method | Mean | StdErr | StdDev | Gen 0 | Allocated |

----------------------- |----------- |---------- |---------- |------- |---------- |

DictionaryEnumeration | ?? ns | ?? ns | ?? ns | ?? | ?? B |

IDictionaryEnumeration | ?? ns | ?? ns | ?? ns | ?? | ?? B |

// * Diagnostic Output - MemoryDiagnoser *

Note: the Gen 0/1/2 Measurements are per 1k Operations

Abstractions - IDictionary v Dictionary

BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1

Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8

Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC

[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0

DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0

Method | Mean | StdErr | StdDev | Gen 0 | Allocated |

----------------------- |----------- |---------- |---------- |------- |---------- |

DictionaryEnumeration | 24.0353 ns | 0.2403 ns | 0.9307 ns | - | 0 B |

IDictionaryEnumeration | 41.6301 ns | 0.4479 ns | 2.1944 ns | 0.0086 | 32 B |

// * Diagnostic Output - MemoryDiagnoser *

Note: the Gen 0/1/2 Measurements are per 1k Operations

Abstractions - IDictionary v Dictionary

Dictionary<string, string> dictionary =new Dictionary<string, string>();

IDictionary<string, string> iDictionary =(IDictionary<string, string>)dictionary;

// struct – so doesn't allocateDictionary<string, string>.Enumerator enumerator =

dictionary.GetEnumerator();

// interface - allocates 56 B (64-bit) and 32 B (32-bit)IEnumerator<KeyValuePair<string, string>> enumerator =

iDictionary.GetEnumerator();

Low-level increments[LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job]public class Program{

private double a, b, c, d;

[Benchmark(OperationsPerInvoke = 4)]public void MethodA(){

a++; b++; c++; d++;}

[Benchmark(OperationsPerInvoke = 4)]public void MethodB(){

a++; a++; a++; a++;}

}

Low-level incrementsBenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC

[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1590.0;compatjit-v4.6.1590.0LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1590.0

Runtime=Clr Allocated=0 B

Method | Job | Jit | Platform | Mean | StdErr | StdDev |----------- |------------- |---------- |--------- |---------- |---------- |---------- |

Parallel | LegacyJitX64 | LegacyJit | X64 | ?? ns | ?? ns | ?? ns |Sequential | LegacyJitX64 | LegacyJit | X64 | ?? ns | ?? ns | ?? ns |Parallel | LegacyJitX86 | LegacyJit | X86 | ?? ns | ?? ns | ?? ns |

Sequential | LegacyJitX86 | LegacyJit | X86 | ?? ns | ?? ns | ?? ns |Parallel | RyuJitX64 | RyuJit | X64 | ?? ns | ?? ns | ?? ns |

Sequential | RyuJitX64 | RyuJit | X64 | ?? ns | ?? ns | ?? ns |

MethodA = Parallel, MethodB() = Sequential

Low-level incrementsBenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC

[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1590.0;compatjit-v4.6.1590.0LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1590.0

Runtime=Clr Allocated=0 B

Method | Job | Jit | Platform | Mean | StdErr | StdDev |----------- |------------- |---------- |--------- |---------- |---------- |---------- |

Parallel | LegacyJitX64 | LegacyJit | X64 | 0.3420 ns | 0.0015 ns | 0.0057 ns |Sequential | LegacyJitX64 | LegacyJit | X64 | 2.2038 ns | 0.0014 ns | 0.0051 ns |Parallel | LegacyJitX86 | LegacyJit | X86 | 0.3276 ns | 0.0005 ns | 0.0020 ns |

Sequential | LegacyJitX86 | LegacyJit | X86 | 2.5229 ns | 0.0048 ns | 0.0187 ns |Parallel | RyuJitX64 | RyuJit | X64 | 0.3686 ns | 0.0037 ns | 0.0144 ns |

Sequential | RyuJitX64 | RyuJit | X64 | 0.8959 ns | 0.0023 ns | 0.0090 ns |

MethodA = Parallel, MethodB() = Sequential

http://en.wikipedia.org/wiki/Instruction-level_parallelism

Search - Linear v Binaryprivate static int LinearSearch(

Data[] set, int key){

for (int i = 0; i < set.Length; i++){

var c = set[i].Key - key;if (c == 0){

return i;}if (c > 0){

return ~i;}

}return ~set.Length;

}

private static int BinarySearch(Data[] set, int key)

{int i = 0;int up = set.Length - 1;while (i <= up){

int mid = (up - i) / 2 + i;int c = set[mid].Key - key;if (c == 0){

return mid;}if (c < 0)

i = mid + 1;else

up = mid - 1;}return ~i;

}

Search - Linear v Binary

private readonly Data[][] dataSet;private Data[] currentSet;private int currentMid;private int currentMax;

[Params(1, 2, 3, 4, 5, 7, 10, 12, 15)]public int Size{

set{

currentSet = dataSet[value];currentMax = value - 1;currentMid = value / 2;

}}

LinearSearch v Binary Search

LinearSearch v Binary Search

readonly fieldspublic struct Int256{

private readonly long bits0, bits1,bits2, bits3;

public Int256(long bits0, long bits1,long bits2, long bits3)

{this.bits0 = bits0; this.bits1 = bits1;this.bits2 = bits2; this.bits3 = bits3;

}

public long Bits0 { get { return bits0; } }public long Bits1 { get { return bits1; } }public long Bits2 { get { return bits2; } }public long Bits3 { get { return bits3; } }

}

private readonly Int256 readOnlyField =new Int256(1L, 5L, 10L, 100L);

private Int256 field =new Int256(1L, 5L, 10L, 100L);

[LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job]public class Program{

[Benchmark]public long GetValue(){

return field.Bits0 + field.Bits1 +field.Bits2 + field.Bits3;

}

[Benchmark]public long GetReadOnlyValue(){

return readOnlyField.Bits0 +readOnlyField.Bits1 +readOnlyField.Bits2 +readOnlyField.Bits3;

}}

readonly fieldsBenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1

Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8

Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC

[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0

LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1590.0;compatjit-v4.6.1590.0

LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0

RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1590.0

Runtime=Clr Allocated=0 B

Method | Job | Jit | Platform | Mean | StdErr | StdDev |

----------------- |------------- |---------- |--------- |---------- |---------- |---------- |

GetValue | LegacyJitX64 | LegacyJit | X64 | ?? ns | ?? ns | ?? ns |

GetReadOnlyValue | LegacyJitX64 | LegacyJit | X64 | ?? ns | ?? ns | ?? ns |

GetValue | LegacyJitX86 | LegacyJit | X86 | ?? ns | ?? ns | ?? ns |

GetReadOnlyValue | LegacyJitX86 | LegacyJit | X86 | ?? ns | ?? ns | ?? ns |

GetValue | RyuJitX64 | RyuJit | X64 | ?? ns | ?? ns | ?? ns |

GetReadOnlyValue | RyuJitX64 | RyuJit | X64 | ?? ns | ?? ns | ?? ns |

readonly fieldsBenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.1.7601 Service Pack 1

Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8

Frequency=2630673 Hz, Resolution=380.1309 ns, Timer=TSC

[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0

LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1590.0;compatjit-v4.6.1590.0

LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1590.0

RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1590.0

Runtime=Clr Allocated=0 B

Method | Job | Jit | Platform | Mean | StdErr | StdDev |

----------------- |------------- |---------- |--------- |---------- |---------- |---------- |

GetValue | LegacyJitX64 | LegacyJit | X64 | 0.7893 ns | 0.0078 ns | 0.0291 ns |

GetReadOnlyValue | LegacyJitX64 | LegacyJit | X64 | 9.5362 ns | 0.0251 ns | 0.0971 ns |

GetValue | LegacyJitX86 | LegacyJit | X86 | 1.4625 ns | 0.0506 ns | 0.1959 ns |

GetReadOnlyValue | LegacyJitX86 | LegacyJit | X86 | 1.9743 ns | 0.0641 ns | 0.2481 ns |

GetValue | RyuJitX64 | RyuJit | X64 | 0.3852 ns | 0.0183 ns | 0.0710 ns |

GetReadOnlyValue | RyuJitX64 | RyuJit | X64 | 9.6406 ns | 0.0803 ns | 0.3109 ns |

https://codeblog.jonskeet.uk/2014/07/16/micro-optimization-the-surprising-inefficiency-of-readonly-fields/

MOAR Benchmarks!!Analysing Optimisations in the Wire Serialiser

• http://mattwarren.org/2016/08/23/Analysing-Optimisations-in-the-Wire-Serialiser/

Optimising LINQ• http://mattwarren.org/2016/09/29/Optimising-LINQ/

Why is reflection slow?• http://mattwarren.org/2016/12/14/Why-is-Reflection-slow/

Why Exceptions should be Exceptional• http://mattwarren.org/2016/12/20/Why-Exceptions-should-be-Exceptional/

Resources

QUESTIONS?