(pfc302) performance benchmarking on aws | aws re:invent 2014
DESCRIPTION
In this session, we explain how to measure the key performance-impacting metrics in a cloud-based application and best practices for a reliable benchmarking process. Measuring the performance of applications correctly can be challenging and there are many tools available to measure and track performance. This session will provide you with specific examples of good and bad tests. We make it clear how to get reliable measurements of and how to map benchmark results to your application. We also cover the importance of selecting tests wisely, repeating tests, and measuring variability. In addition a customer will provide real-life examples of how they developed their application testing stack, utilize it for repeatable testing and identify bottlenecks.TRANSCRIPT
• The best benchmark• Absolute vs. relative measures• Fixed time or fixed work• What’s different?• Use a good AMI
0.00 5.00 10.0015.0020.0025.0030.00
Ubuntu 12.4 ami-…AWS CentOS 5.4 ami-…
CentOS 5.4 ami-…CentOS 5.4 ami-…CentOS 5.4 ami-…
Average CPU result
0%
10%
20%
30%
40%
50%
60%
Coefficient of Variance
• Application runs on premises
• Primary requirement is integer CPU performance
• Application is complex to set up, no benchmark tests exist, limited time
• What instance would work best?
1. Choose a synthetic benchmark
2. Baseline: Build, configure, tune, and run it on premises
3. Run the same test (or tests) on a set of instance types
4. Use results from the instance tests to choose the best match
Integer
AES
Twofish
SHA1
SHA2
BZip2 compress
BZip2 decompress
JPEG compress
JPEG decompress
PNG compress
PNG decompress
Sobel
LUA
Dijkstra
Floating Point
Black-Scholes
Mandelbrot
Sharpen image
Blur image
SGEMM
DGEMM
SFFT
DFFT
N-Body
Ray trace
Memory
STREAM copy
STREAM scale
STREAM add
STREAM triad
ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`"
TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`
./geekbench_x86_64 --no-upload >$GBTXT
Geekbench
1CPU ratio C.O.V. NCPU ratio C.O.V. RT (min)
m3.xlarge 0.93 1.04% 2.04 2.31% 2.06
m3.2xlarge 0.93 1.40% 3.80 1.46% 2.08
m2.xlarge 0.80 2.84% 1.54 4.06% 1.99
m2.2xlarge 0.80 1.34% 2.82 1.21% 2.04
m2.4xlarge 0.76 2.28% 5.11 1.71% 2.01
c3.large 1.13 0.93% 1.32 0.71% 1.76
c3.xlarge 1.13 0.39% 2.51 1.81% 1.74
c3.2xlarge 1.13 0.19% 4.88 0.25% 1.70
cc2.8xlarge 1.00 0.71% 15.46 1.93% 2.21
geekbench 1CPU ratio C.O.V.
m3.xlarge
instance-1 0.93 0.31%
instance-2 0.97 0.23%
instance-3 0.94 0.17%
instance-4 0.94 0.10%
instance-5 0.94 0.32%
instance-6 0.94 0.10%
instance-7 0.93 0.25%
instance-8 0.93 0.38%
instance-9 0.94 0.11%
instance-10 0.94 0.09%
gb-integer 1CPU ratio C.O.V. NCPU ratio C.O.V. RT (min)
c3.large 1.12 0.50% 1.37 0.43% NA
c3.xlarge 1.13 0.38% 2.72 0.41% NA
c3.2xlarge 1.12 0.38% 5.35 0.51% NA
cc2.8xlarge 1.00 0.20% 17.88 3.31% NA
geekbench
c3.large 1.13 0.93% 1.32 0.71% 1.76
c3.xlarge 1.13 0.39% 2.51 1.81% 1.74
c3.2xlarge 1.13 0.19% 4.88 0.25% 1.70
cc2.8xlarge 1.00 0.71% 15.46 1.93% 2.21
11
ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`"
TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`"
./Run –c 1 –c $COPIES >$FN
UnixBench 1CPU ratio C.O.V. NCPU ratio C.O.V. RT (min)
m3.xlarge 1.38 1.90% 2.49 1.36% 28.25
m3.2xlarge 1.42 1.85% 4.21 1.99% 28.29
m2.xlarge 0.40 5.82% 0.76 1.28% 28.30
m2.2xlarge 0.42 1.71% 1.23 1.75% 28.32
m2.4xlarge 0.48 3.31% 2.02 1.71% 28.34
c3.large 1.10 1.33% 1.91 1.54% 28.17
c3.xlarge 1.06 1.48% 2.85 1.26% 28.21
c3.2xlarge 1.10 0.54% 4.50 1.02% 28.96
cc2.8xlarge 1.00 2.97% 6.44 2.65% 30.20
UB-Integer 1CPU ratio C.O.V. NCPU ratio C.O.V. RT (min)
c3.large 1.05 0.24% 1.10 0.30% 0.17
c3.xlarge 1.05 0.27% 2.20 0.28% 0.17
c3.2xlarge 1.05 0.07% 4.34 0.23% 0.17
cc2.8xlarg
e 1.00 0.10% 15.54 0.95% 0.17
UnixBench
c3.large 1.10 1.33% 1.91 1.54% 28.17
c3.xlarge 1.06 1.48% 2.85 1.26% 28.21
c3.2xlarge 1.10 0.54% 4.50 1.02% 28.96
cc2.8xlarg
e 1.00 2.97% 6.44 2.65% 30.20
www.spec.org
Benchmark Category
400.perlbench C Programming language
401.bzip2 C Compression
403.gcc C C compiler
429.mcf C Combinatorial optimization
445.gobmk C Artificial intelligence
456.hmmer C Search gene sequence
458.sjeng C Artificial intelligence
462.libquantum C Physics / quantum computing
464.h264ref C Video compression
471.omnetpp C++ Discrete event simulation
473.astar C++ Path-finding algorithms
483.xalancbmk C++ Xml processing
ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`”
TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`”
runspec –noreportable –tune=base –size=ref –rate=$COPIES –iterations=1 /
400 403 445 456 458 462 464 471 473 483
Est.
SPECint 1CPU ratio C.O.V. RT (min)
NCPU
ratio C.O.V. RT (min)
m3.xlarge 1.01 1.06% 54.39 2.24 1.15% 104.18
m3.2xlarge 1.01 1.67% 54.49 4.25 1.63% 109.22
m2.xlarge 0.76 1.97% 70.83 1.39 2.45% 85.37
m2.2xlarge 0.79 0.94% 68.85 2.76 1.24% 85.42
m2.4xlarge 0.78 0.16% 68.73 5.21 1.26% 89.91
c3.large 1.11 1.95% 50.00 1.25 1.47% 94.22
c3.xlarge 1.10 1.96% 50.29 2.39 1.28% 97.66
c3.2xlarge 1.08 0.87% 50.87 4.67 0.25% 100.22
cc2.8xlarge 1.00 0.29% 54.92 14.92 0.52% 125.74
ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`”
TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`”
sysbench –num-threads=$TDS --max-requests=30000 --test=cpu /
--cpu-max-prime=100000 run > $FN
sysbench Default C.O.V. RT (min)
m3.xlarge 3.21 1.44% 0.06
m3.2xlarge 6.41 1.38% 0.03
m2.xlarge 1.59 0.75% 0.11
m2.2xlarge 3.19 0.64% 0.06
m2.4xlarge 8.83 0.62% 0.02
c3.large 1.78 0.26% 0.10
c3.xlarge 3.55 0.53% 0.05
c3.2xlarge 6.55 8.45% 0.03
cc2.8xlarge 25.34 2.30% 0.01
tuned ratio C.O.V. RT (min)
1.69 1.29% 3.86
3.38 1.41% 1.93
0.80 0.23% 8.16
1.60 0.76% 4.07
4.71 0.20% 1.38
0.91 0.09% 7.13
1.83 0.02% 3.57
3.54 3.31% 1.85
13.69 1.10% 0.48
GB GB
Int
UB UB
Int
Est.
SPECInt
sysbench
default
sysbench
tuned
m3.xlarge 2.04 2.01 2.49 1.88 2.24 3.21 1.69
m3.2xlarge 3.80 3.96 4.21 3.77 4.25 6.41 3.38
m2.xlarge 1.54 1.52 0.76 1.59 1.38 1.59 0.80
m2.2xlarge 2.82 3.02 1.23 3.19 2.76 3.19 1.60
m2.4xlarge 5.11 5.54 2.02 6.48 5.21 8.83 4.71
c3.large 1.32 1.37 1.91 1.10 1.25 1.78 0.91
c3.xlarge 2.51 2.72 2.85 2.20 2.39 3.55 1.83
c3.2xlarge 4.88 5.35 4.50 4.34 4.67 6.55 3.54
cc2.8xlarge 15.46 17.88 6.44 15.5
4
14.92 25.34 13.69
• Application runs on premises
• Primary requirement: memory throughput of 20K MB/sec
• What instance would work best?
1. Choose a synthetic benchmark
2. Baseline: Build, configure, tune, and run it on premises
3. Run the same test (or tests) on a set of instance types
4. Use results from the instance tests to choose the best match
www.cs.virginia.edu/stream/top20/Bandwidth.html
https://github.com/gregs1104/stream-scaling
name kernel
bytes
iter
FLOPS
iter
COPY: a(i) = b(i) 16 0
SCALE: a(i) = q*b(i) 16 1
SUM: a(i) = b(i) + c(i) 24 1
TRIAD: a(i) = b(i) + q*c(i) 24 2
* McCalpin, John D.: "STREAM: Sustainable Memory Bandwidth in High Performance Computers",
ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`”
TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`”
./stream | egrep \
"Number of Threads requested|Function|Triad|Failed|Expected|Observed" > $FN
./sysbench --num-threads=$TDS --test=memory run >$FN
Stream-
Triad
Geekbench
Memory-Triad
sysbench
(default)
m3.xlarge 23640.56 15375.64 302.95
m3.2xlarge 26046.17 14999.27 603.40
m2.xlarge 18766.58 17365.76 528.16
m2.2xlarge 22421.91 17600.00 1019.08
m2.4xlarge 19634.50 14405.82 1576.30
c3.large 11434.83 9967.96 2116.84
c3.xlarge 21141.30 13972.65 2643.33
c3.2xlarge 30235.78 20657.49 2944.91
cc2.8xlarge 55200.86 37067.32 1195.90
sysbench memory defaults
--memory-block-size [1K]
--memory-total-size [100G]
--memory-scope {global,local} [global]
--memory-hugetlb [off]
--memory-oper {read, write, none} [write]
--memory-access-mode {seq,rnd} [seq]
• I/O metrics– IOPs
– Throughput
– Latency
• Test parameters:– Read %
– Write %
– Sequential
– Random
– Queue depth
• Storage configuration– Volume(s)
– RAID
– LVM
0
200
400
600
800
1000
1200
Seq.Read
Seq.Write
MixedSeq
Read
MixedSeqWrite
RandRead
RandWrite
MixedRandRead
MixedRandWrite
Late
ncy (
usec)
PIOPs 2K Queue Depth
1D PIOPS 2K
1D PIOPS 2KQD22D PIOPS 2K
2D PIOPS 2KQD2
• disk copy
• cp file1 /disk1/file1
• dd
• dd if=/dev/zero of=/data1/testile1 \
bs=1048 count=1024000
• fio – flexible io tester
• fio simple.cfg
Seconds MB/sec
cp f1 f2 17.248 59.37
rm –rf f2; cp f1 f2 .853 1200.47
cp f1 f3 .880 1164.96
dd if=/dev/zero bs=1048 count=1024000 of=d1 .722 1419.01
dd if=/dev/urandom bs=1048 count=1024000 of=d2 79.710 12.84
fio simple.cfg NA 61.55
Random
1M I/O
PIOPs 16disk
MBps
read 1006.73
write 904.03
r70w30 1005.91
If benchmarking your application is not practical, synthetic
benchmarks can be used if you are careful.
• Choose the best benchmark that represents your application
• Analysis – what does “best” mean?
• Run enough tests to quantify variability
• Baseline – what is a “good result” ?
• Samples – keep all of your results – more is better!
tech.just-eat.com @justeat_tech
https://loadtestingtool.com
https://github.com/etsy/statsd
https://graphite.readthedocs.org
Please give us your feedback on this session.
Complete session evaluations and earn re:Invent swag.
http://bit.ly/awsevals