(pfc302) performance benchmarking on aws | aws re:invent 2014

• The best benchmark• Absolute vs. relative measures• Fixed time or fixed work• What’s different?• Use a good AMI

0.00 5.00 10.0015.0020.0025.0030.00

Ubuntu 12.4 ami-…AWS CentOS 5.4 ami-…

CentOS 5.4 ami-…CentOS 5.4 ami-…CentOS 5.4 ami-…

Average CPU result

0%

10%

20%

30%

40%

50%

60%

Coefficient of Variance

• Application runs on premises

• Primary requirement is integer CPU performance

• Application is complex to set up, no benchmark tests exist, limited time

• What instance would work best?

1. Choose a synthetic benchmark

2. Baseline: Build, configure, tune, and run it on premises

3. Run the same test (or tests) on a set of instance types

4. Use results from the instance tests to choose the best match

Integer

AES

Twofish

SHA1

SHA2

BZip2 compress

BZip2 decompress

JPEG compress

JPEG decompress

PNG compress

PNG decompress

Sobel

LUA

Dijkstra

Floating Point

Black-Scholes

Mandelbrot

Sharpen image

Blur image

SGEMM

DGEMM

SFFT

DFFT

N-Body

Ray trace

Memory

STREAM copy

STREAM scale

STREAM add

STREAM triad

ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`"

TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`

./geekbench_x86_64 --no-upload >$GBTXT

Geekbench

1CPU ratio C.O.V. NCPU ratio C.O.V. RT (min)

m3.xlarge 0.93 1.04% 2.04 2.31% 2.06

m3.2xlarge 0.93 1.40% 3.80 1.46% 2.08

m2.xlarge 0.80 2.84% 1.54 4.06% 1.99

m2.2xlarge 0.80 1.34% 2.82 1.21% 2.04

m2.4xlarge 0.76 2.28% 5.11 1.71% 2.01

c3.large 1.13 0.93% 1.32 0.71% 1.76

c3.xlarge 1.13 0.39% 2.51 1.81% 1.74

c3.2xlarge 1.13 0.19% 4.88 0.25% 1.70

cc2.8xlarge 1.00 0.71% 15.46 1.93% 2.21

geekbench 1CPU ratio C.O.V.

m3.xlarge

instance-1 0.93 0.31%

instance-2 0.97 0.23%

instance-3 0.94 0.17%

instance-4 0.94 0.10%

instance-5 0.94 0.32%

instance-6 0.94 0.10%

instance-7 0.93 0.25%

instance-8 0.93 0.38%

instance-9 0.94 0.11%

instance-10 0.94 0.09%

gb-integer 1CPU ratio C.O.V. NCPU ratio C.O.V. RT (min)

c3.large 1.12 0.50% 1.37 0.43% NA

c3.xlarge 1.13 0.38% 2.72 0.41% NA

c3.2xlarge 1.12 0.38% 5.35 0.51% NA

cc2.8xlarge 1.00 0.20% 17.88 3.31% NA

geekbench

c3.large 1.13 0.93% 1.32 0.71% 1.76

c3.xlarge 1.13 0.39% 2.51 1.81% 1.74

c3.2xlarge 1.13 0.19% 4.88 0.25% 1.70

cc2.8xlarge 1.00 0.71% 15.46 1.93% 2.21

ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`"

TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`"

./Run –c 1 –c $COPIES >$FN

UnixBench 1CPU ratio C.O.V. NCPU ratio C.O.V. RT (min)

m3.xlarge 1.38 1.90% 2.49 1.36% 28.25

m3.2xlarge 1.42 1.85% 4.21 1.99% 28.29

m2.xlarge 0.40 5.82% 0.76 1.28% 28.30

m2.2xlarge 0.42 1.71% 1.23 1.75% 28.32

m2.4xlarge 0.48 3.31% 2.02 1.71% 28.34

c3.large 1.10 1.33% 1.91 1.54% 28.17

c3.xlarge 1.06 1.48% 2.85 1.26% 28.21

c3.2xlarge 1.10 0.54% 4.50 1.02% 28.96

cc2.8xlarge 1.00 2.97% 6.44 2.65% 30.20

UB-Integer 1CPU ratio C.O.V. NCPU ratio C.O.V. RT (min)

c3.large 1.05 0.24% 1.10 0.30% 0.17

c3.xlarge 1.05 0.27% 2.20 0.28% 0.17

c3.2xlarge 1.05 0.07% 4.34 0.23% 0.17

cc2.8xlarg

e 1.00 0.10% 15.54 0.95% 0.17

UnixBench

c3.large 1.10 1.33% 1.91 1.54% 28.17

c3.xlarge 1.06 1.48% 2.85 1.26% 28.21

c3.2xlarge 1.10 0.54% 4.50 1.02% 28.96

cc2.8xlarg

e 1.00 2.97% 6.44 2.65% 30.20

www.spec.org

http://www.spec.org

Benchmark Category

400.perlbench C Programming language

401.bzip2 C Compression

403.gcc C C compiler

429.mcf C Combinatorial optimization

445.gobmk C Artificial intelligence

456.hmmer C Search gene sequence

458.sjeng C Artificial intelligence

462.libquantum C Physics / quantum computing

464.h264ref C Video compression

471.omnetpp C++ Discrete event simulation

473.astar C++ Path-finding algorithms

483.xalancbmk C++ Xml processing

ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`”

TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`”

runspec –noreportable –tune=base –size=ref –rate=$COPIES –iterations=1 /

400 403 445 456 458 462 464 471 473 483

Est.

SPECint 1CPU ratio C.O.V. RT (min)

NCPU

ratio C.O.V. RT (min)

m3.xlarge 1.01 1.06% 54.39 2.24 1.15% 104.18

m3.2xlarge 1.01 1.67% 54.49 4.25 1.63% 109.22

m2.xlarge 0.76 1.97% 70.83 1.39 2.45% 85.37

m2.2xlarge 0.79 0.94% 68.85 2.76 1.24% 85.42

m2.4xlarge 0.78 0.16% 68.73 5.21 1.26% 89.91

c3.large 1.11 1.95% 50.00 1.25 1.47% 94.22

c3.xlarge 1.10 1.96% 50.29 2.39 1.28% 97.66

c3.2xlarge 1.08 0.87% 50.87 4.67 0.25% 100.22

cc2.8xlarge 1.00 0.29% 54.92 14.92 0.52% 125.74



sysbench –num-threads=$TDS --max-requests=30000 --test=cpu /

--cpu-max-prime=100000 run > $FN

sysbench Default C.O.V. RT (min)

m3.xlarge 3.21 1.44% 0.06

m3.2xlarge 6.41 1.38% 0.03

m2.xlarge 1.59 0.75% 0.11

m2.2xlarge 3.19 0.64% 0.06

m2.4xlarge 8.83 0.62% 0.02

c3.large 1.78 0.26% 0.10

c3.xlarge 3.55 0.53% 0.05

c3.2xlarge 6.55 8.45% 0.03

cc2.8xlarge 25.34 2.30% 0.01

tuned ratio C.O.V. RT (min)

1.69 1.29% 3.86

3.38 1.41% 1.93

0.80 0.23% 8.16

1.60 0.76% 4.07

4.71 0.20% 1.38

0.91 0.09% 7.13

1.83 0.02% 3.57

3.54 3.31% 1.85

13.69 1.10% 0.48

GB GB

Int

UB UB

Int

Est.

SPECInt

sysbench

default

sysbench

tuned

m3.xlarge 2.04 2.01 2.49 1.88 2.24 3.21 1.69

m3.2xlarge 3.80 3.96 4.21 3.77 4.25 6.41 3.38

m2.xlarge 1.54 1.52 0.76 1.59 1.38 1.59 0.80

m2.2xlarge 2.82 3.02 1.23 3.19 2.76 3.19 1.60

m2.4xlarge 5.11 5.54 2.02 6.48 5.21 8.83 4.71

c3.large 1.32 1.37 1.91 1.10 1.25 1.78 0.91

c3.xlarge 2.51 2.72 2.85 2.20 2.39 3.55 1.83

c3.2xlarge 4.88 5.35 4.50 4.34 4.67 6.55 3.54

cc2.8xlarge 15.46 17.88 6.44 15.5

4

14.92 25.34 13.69

• Application runs on premises

• Primary requirement: memory throughput of 20K MB/sec

• What instance would work best?

1. Choose a synthetic benchmark

2. Baseline: Build, configure, tune, and run it on premises

3. Run the same test (or tests) on a set of instance types

4. Use results from the instance tests to choose the best match

www.cs.virginia.edu/stream/top20/Bandwidth.html

https://github.com/gregs1104/stream-scaling

name kernel

bytes

iter

FLOPS

iter

COPY: a(i) = b(i) 16 0

SCALE: a(i) = q*b(i) 16 1

SUM: a(i) = b(i) + c(i) 24 1

TRIAD: a(i) = b(i) + q*c(i) 24 2

* McCalpin, John D.: "STREAM: Sustainable Memory Bandwidth in High Performance Computers",

http://www.cs.virginia.edu/stream/top20/Bandwidth.html



Stream-

Triad

Geekbench

Memory-Triad

sysbench

(default)

m3.xlarge 23640.56 15375.64 302.95

m3.2xlarge 26046.17 14999.27 603.40

m2.xlarge 18766.58 17365.76 528.16

m2.2xlarge 22421.91 17600.00 1019.08

m2.4xlarge 19634.50 14405.82 1576.30

c3.large 11434.83 9967.96 2116.84

c3.xlarge 21141.30 13972.65 2643.33

c3.2xlarge 30235.78 20657.49 2944.91

cc2.8xlarge 55200.86 37067.32 1195.90

sysbench memory defaults

--memory-block-size [1K]

--memory-total-size [100G]

--memory-scope {global,local} [global]

--memory-hugetlb [off]

--memory-oper {read, write, none} [write]

--memory-access-mode {seq,rnd} [seq]

• I/O metrics– IOPs

– Throughput

– Latency

• Test parameters:– Read %

– Write %

– Sequential

– Random

– Queue depth

• Storage configuration– Volume(s)

– RAID

– LVM

0

200

400

600

800

1000

1200

Seq.Read

Seq.Write

MixedSeq

Read

MixedSeqWrite

RandRead

RandWrite

MixedRandRead

MixedRandWrite

Late

ncy (

usec)

PIOPs 2K Queue Depth

1D PIOPS 2K

1D PIOPS 2KQD22D PIOPS 2K

2D PIOPS 2KQD2

• disk copy

• cp file1 /disk1/file1

• dd

• dd if=/dev/zero of=/data1/testile1 \

bs=1048 count=1024000

• fio – flexible io tester

• fio simple.cfg

Seconds MB/sec

cp f1 f2 17.248 59.37

rm –rf f2; cp f1 f2 .853 1200.47

cp f1 f3 .880 1164.96

dd if=/dev/zero bs=1048 count=1024000 of=d1 .722 1419.01

dd if=/dev/urandom bs=1048 count=1024000 of=d2 79.710 12.84

fio simple.cfg NA 61.55

Random

1M I/O

PIOPs 16disk

MBps

read 1006.73

write 904.03

r70w30 1005.91

If benchmarking your application is not practical, synthetic

benchmarks can be used if you are careful.

• Choose the best benchmark that represents your application

• Analysis – what does “best” mean?

• Run enough tests to quantify variability

• Baseline – what is a “good result” ?

• Samples – keep all of your results – more is better!

tech.just-eat.com @justeat_tech

https://loadtestingtool.com

https://loadtestingtool.com/

https://github.com/etsy/statsd

https://graphite.readthedocs.org

https://github.com/etsy/statsd

https://graphite.readthedocs.org/

Please give us your feedback on this session.

Complete session evaluations and earn re:Invent swag.

http://bit.ly/awsevals

http://bit.ly/awsevals

(pfc302) performance benchmarking on aws | aws re:invent 2014

Technology

xlarge instance

unixbench c3

na geekbench c3

ncpu ratio

c c compiler

o http

wget q

instance tests