Transcript
Page 1: Understanding Android Benchmarks

Understanding Android Benchmarks

“freedom” koan-sin tan [email protected]

OSDC.tw, Taipei Apr 11th, 2014

1

Page 2: Understanding Android Benchmarks

disclaimers

• many of the materials used in this slide deck are from the Internet and textbooks, e.g., many of the following materials are from “Computer Architecture: A Quantitative Approach,” 1st ~ 5th ed

• opinions expressed here are my personal one, don’t reflect my employer’s view

2

Page 3: Understanding Android Benchmarks

who am i

• did some networking and security research before

• working for a SoC company, recently on

• big.LITTLE scheduling and related stuff

• parallel construct evaluation

• run benchmarking from time to time

• for improving performance of our products, and

• know what our colleagues' progress

3

Page 4: Understanding Android Benchmarks

• Focusing on CPU and memory parts of benchmarks

• let’s ignore graphics (2d, 3d), storage I/O, etc.

4

Page 5: Understanding Android Benchmarks

Blackbox!

• google image search “benchmark”, you can find many of them are Android-related benchmarks

• Similar to recently Cross-Strait Trade in Services Agreement (TiSA), most benchmarks on Android platform are kinda blackbox

5

Page 6: Understanding Android Benchmarks

Is Apple A7 good?• When Apple released the new

iPhone 5s, you saw many technical blog showed some benchmarks for reviews they came up

• commonly used ones:

• GeekBench

• JavaScript benchmarks

• Some graphics benchmarks

• Why? Are they right ones? etc.

e.g., http://www.anandtech.com/show/7335/the-iphone-5s-review6

Page 7: Understanding Android Benchmarks

open blackbox7

Page 8: Understanding Android Benchmarks

Android Benchmarks

8

Page 9: Understanding Android Benchmarks

No, not improvement in this way9

http://www.anandtech.com/show/7384/state-of-cheating-in-android-benchmarks

Page 10: Understanding Android Benchmarks

Assuming there is not cheating, what we we

can do?

Page 11: Understanding Android Benchmarks

Outline

• Performance benchmark review

• Some Android benchmarks

• What we did and what still can be done

• Future

11

Page 12: Understanding Android Benchmarks

To quote what Prof. Raj Jain quoted

• Benchmark v. trans. To subject (a system) to a series of tests in order to obtain prearranged results not available on competitive systems

From: “The Devil’s DP Dictionary” S. Kelly-Bootle

12

Page 13: Understanding Android Benchmarks

Why benchmarking

• We did something good, let check if we did it right

• comparing with own previous results to see if we break anything

• We want to know how good our colleagues in other places are

13

Page 14: Understanding Android Benchmarks

What to report?

• Usually, what we mean by “benchmarking” is to measure performance

• What to report?

• intuitive answer: how many things we do in certain period of time

• yes, time. E.g., MIPS, MFLOPS, MiB/s, bps

14

Page 15: Understanding Android Benchmarks

MIPS and MFLOPS

• MIPS  (Million  Instruc0ons  per  Second),  MFLOPS  (Million  Floa0ng-­‐Point  Opera0ons  per  Second)  

• All  instruc0ons  are  not  created  equal  – CISC  machine  instruc0ons  usually  accomplish  a  lot  more  than  those  of  RISC  machines,  comparing  the  instruc0ons  of  a  CISC  machine  and  a  RISC  machine  is  similar  to  comparing  La0n  and  Greek

15

Page 16: Understanding Android Benchmarks

MIPS  and  what’s  wrong  with  them

• MIPS  is  instruc0on  set  dependent,  making  it  difficult  to  compare  MIPS  of  one  computers  with  different  ISA  

• MIPS  varies  between  programs  on  the  same  computers;    and  most  importantly,  

• MIPS  can  vary  inversely  to  performance  –w/  hardware  FP,  generally,  MIPS  is  smaller

16

Page 17: Understanding Android Benchmarks

MFLOPS  and  what’s  wrong  with  them

• Applied  only  to  programs  with  floa0ng-­‐point  opera0ons  

• Opera0ons  instead  of  instruc0ons,  but  s0ll    –floa0ng-­‐point  instruc0ons  are  different  on  machines  different  ISAs  

–Fast  and  slow  floa0ng-­‐point  opera0ons  • Possible  solu0on:  weight  and  source  code  level  count  –ADD,  SUB,  COMPARE  :  1  –DIVIDE,    SQRT:  2  –EXP,  SIN:  4

17

Page 18: Understanding Android Benchmarks

• The  best  choice  of  benchmarks  to  measure  performance  is  real  applica0ons

18

Page 19: Understanding Android Benchmarks

Problema0c  benchmarks

• Kernel:  small,  key  pieces  of  real  applica0ons,  e.g.,  linpack  

• Toy  programs:  100-­‐line  programs  from  beginning  programming  assignments,  e.g.,  quicksort  

• Synthe0c  benchmarks:  fake  programs  invented  to  try  to  match  the  profile  and  behavior  of  really  applica0ons,  e.g.,  Dhrystone

19

Page 20: Understanding Android Benchmarks

Why  they  are  disreputed?

• Small,  fit  in  cache  • Obsolete  instruc0on  mix  • Uncontrolled  source  code  • Prone  to  compiler  tricks  • Short  run0mes  on  modern  machines  • Single-­‐number  performance  characteriza0on  with  a  single  benchmark  

• Difficult  to  reproduce  results  (short  run0me  and  low-­‐precision  UNIX  0mer)

20

Page 21: Understanding Android Benchmarks

Dhrystone

• Source  –hhp://homepages.cwi.nl/~steven/dry.c  

• <  1000  LoC  –Size  of  CA15  binary  compiled  with  bionic    

• Instruc0ons:  ~  14  KiB

text data bss dec

13918 467 10266 24660

21

Page 22: Understanding Android Benchmarks

Whetstone

• Dhrystone  is  a  pun  on  Whetstone  

• Source  code:  hhp://www.netlib.org/benchmark/whetstone.c

Test MFLOPS MOPS ms

N1 float 119.78 0.16

N2 float 171.98 0.78 N3 if 154.25 0.67 N4 fixpt 397.48 0.79 N5 cos 19.08 4.36 N6 float 84.22 6.41 N7 equal 86.84 2.13 N8 exp 5.95 6.26 MWIPS 463.97 21.55

22

Page 23: Understanding Android Benchmarks

More  on  Synthe0c  benchmarks• The  best  known  examples  of  synthe0c  benchmarks  are  Whetstone  and  Dhrystone    

• Problems:  – Compiler  and  hardware  op0miza0ons  can  ar0ficially  inflate  performance  of  these  benchmarks  but  not  of  real  programs  

– The  other  side  of  the  coin  is  that  because  these  benchmarks  are  not  natural  programs,  they  don’t  reward  op0miza0ons  of  behaviors  that  occur  in  real  programs  

• Examples:  – Op0mizing  compilers  can  discard  25%  of  the  Dhrystone  code;  examples  include  loops  that  are  only  executed  once,  making  the  loop  overhead  instruc0ons  unnecessary  

– Most  Whetstone  floa0ng-­‐point  loops  execute  small  numbers  of  0mes  or  include  calls  inside  the  loop.  These  characteris0cs  are  different  from  many  real  programs  

– Some  more  discussion  in  1st  edi0on  of  the  textbook

23

Page 24: Understanding Android Benchmarks

LINPACK

• LINPACK:  a  floa0ng  point  benchmark  from  the  manual  of  LINPACK  library  

• Source  –hhp://www.netlib.org/benchmark/linpackc  –hhp://www.netlib.org/benchmark/linpackc.new  

• 883  LoC  –Size  of  CA15  binary  compiled  with  bionic  

• Instruc0ons:  ~  13  KiBtext data bss dec

12670 408 0 1308624

Page 25: Understanding Android Benchmarks

25

Page 26: Understanding Android Benchmarks

CoreMark  (1/2)

• CoreMark  is  a  benchmark  that  aims  to  measure  the  performance  of  central  processing  units  (CPU)  used  in  embedded  systems.  It  was  developed  in  2009  by  Shay  Gal-­‐On  at  EEMBC  and  is  intended  to  become  an  industry  standard,  replacing  the  an0quated  Dhrystone  benchmark  

• The  code  is  wrihen  in  C  code  and  contains  implementa0ons  of  the  following  algorithms:    – Linked  list  processing.  –Matrix  (mathema0cs)  manipula0on  (common  matrix  opera0ons),  – state  machine  (determine  if  an  input  stream  contains  valid  numbers),  and  

– CRC  • from  wikipedia

26

Page 27: Understanding Android Benchmarks

CoreMark  (2/2)

name LoC

core_list_join.c 496

core_matrix.c 308

core_stat.c 277

core_util.c 210

• CoreMark  vs.  Dhrystone  –Repor0ng  rule  –Use  of  library  calls,  e.g.,  malloc()  is  avoided  

–CRC  to  make  sure  data  are  corrected  

• However,  CoreMark  is  a    kernel  +  synthe0c  benchmark,  s0ll  quite  small  footprint

text data bss dec

18632 456 20 1910827

Page 28: Understanding Android Benchmarks

So?

• Too  overcome  the  danger  of  placing  eggs  in  one  basket,  collec0ons  of  benchmark  applica0ons,  called  benchmark  suites,  are  popular  measure  of  performance  of  processors  with  variety  of  applica0ons  

• Standard  Performance  Evalua0on  Corpora0on  (SPEC)

28

Page 29: Understanding Android Benchmarks

29

Page 30: Understanding Android Benchmarks

Why  CPU2000  in  2010s?

• Why  ARM  s0cks  with  SPEC  CPU2000  instead  of  CPU2006  –1999  q4  results,  earliest  available  CPU2000  results  (hhp://www.spec.org/cpu2000/results/res1999q4/)  • CINT2000  base:  133  –  424    • CFP2000  base:  126  –  514  

–2005  Opteron  144,  1.8  GHz  • 1,440  (CA15  1.9  GHz  reported  nVidia  is  1,168)  

–CPU2006  requires  much  more  DRAM,  1  GiB  DRAM  is  not  enough

name CA9 CA7 CA15 Krait

SPECint 200 356 320 537 326

SPECfp 2000 298 236 567 350

All normalized to 1.0 GHz

30

Page 31: Understanding Android Benchmarks

SPEC  numbers  from  Quan0ta0ve  Approach  5th  Edi0on

31

Page 32: Understanding Android Benchmarks

How  long  does  SPEC  CPU2000  take?

• About  1  hrs  to  compile  • Run0me:  Sum  of  base  run0me  mul0plied  by  3  – E.g.,  1.7  GHz  CA15,  (2256+3229)  x  3  =    16,455  s  ~=  4.57  hr  

– For  1.0  GHz:  4.57  x  1.7  =  7.77  hr  

– For  CA7  assuming  twice  slower:  7.77  *  2  =  15.54  hr

BenchmarkReference Base BaseTime Runtime Ratio

164.gzip 1400 215 652175.vpr 1400 198 707176.gcc 1100 94.8 1161181.mcf 1800 266 677186.crafty 1000 118 850197.parser 1800 291 619252.eon 1300 87.8 1480253.perlbmk 1800 172 1045254.gap 1100 107 1026255.vortex 1900 211 899256.bzip2 1500 203 740300.twolf 3000 399 752SPECint_base2000 2256 854

BenchmarkReference Base BaseTime Runtime Ratio

68.wupwise 1600 162 991

171.swim 3100 389 797

172.mgrid 1800 339 532

173.applu 2100 241 870

177.mesa 1400 112 1254

178.galgel 2900 201 1444

179.art 2600 195 1332

183.equake 1300 157 828

187.facerec 1900 183 1036

188.ammp 2200 353 623

189.lucas 2000 134 1491

191.fma3d 2100 212 988

200.sixtrack 1100 241 456

301.apsi 2600 310 839

SPECfp_base2000 435     3229 909.6

32

Page 33: Understanding Android Benchmarks

Figure  1.16  SPEC2006  programs  and  the  evolu0on  of  the  SPEC  benchmarks  over  0me,  with  integer  programs  above  the  line  and  floa0ng-­‐point  programs  below  the  line.  Of  the  12  SPEC2006  integer  programs,  9  are  wrihen  in  C,  and  the  rest  in  C++.  For  the  floa0ng-­‐point  programs,  the  split  is  6  in  Fortran,  4  in  C++,  3  in  C,  and  4  in  mixed  C  and  Fortran.  The  figure  shows  all  70  of  the  programs  in  the  1989,  1992,  1995,  2000,  and  2006  releases.  The  benchmark  descrip0ons  on  the  les  are  for  SPEC2006  only  and  do  not  apply  to  earlier  versions.  Programs  in  the  same  row  from  different  genera0ons  of  SPEC  are  generally  not  related;  for  example,  fpppp  is  not  a  CFD  code  like  bwaves.  Gcc  is  the  senior  ci0zen  of  the  group.  Only  3  integer  programs  and  3  floa0ng-­‐point  programs  survived  three  or  more  genera0ons.  Note  that  all  the  floa0ng-­‐point  programs  are  new  for  SPEC2006.  Although  a  few  are  carried  over  from  genera0on  to  genera0on,  the  version  of  the  program  changes  and  either  the  input  or  the  size  of  the  benchmark  is  osen  changed  to  increase  its  running  0me  and  to  avoid  perturba0on  in  measurement  or  domina0on  of  the  execu0on  0me  by  some  factor  other  than  CPU  0me.  

33

Page 34: Understanding Android Benchmarks

EEMBC• Embedded  Microprocessor  Benchmark  Consor0um    (EEMBC):  41  kernels  used  to  predict  performance  of  different  embedded  applica0ons:  – Automo0ve/industrial  – Consumer  – Networking  – Office  automa0on  – Telecommunica0on  

• 3rd  edi0on  showed  some  EEMBC  results,  4th  edi0on  changed  the  mind  • Unmodified  performance  and  “full-­‐fury”  performance  • Kernel,  repor0ng  op0ons  

– Not  a  good  predictor  of  rela0ve  performance  of  different  embedded  computers

34

Page 35: Understanding Android Benchmarks

Report  benchmark  results

• Reproducible  –Machine  configura0on  (Hardware,  sosware  (OS,  compiler  etc.))  

• Summarizing  results  – You  should  not  add  different  numbers  

• Some  use  weighted  average  –Ra0o,  compare  with  a  reference  machine  

• Geometric  ra1o  – The  geometric  mean  of  the  ra0os  is  the  same  as  the  ra0os  of  geometric  means  

– The  ra0o  of  the  geometric  means  is  equal  to  the  geometric  mean  of  the  performance  ra0os

35

Page 36: Understanding Android Benchmarks

Geometric  mean

36

Page 37: Understanding Android Benchmarks

• Fallacy:  Benchmarks  remain  valid  indefinitely  –Ability  to  resist  “benchmark  engineering”  or  “benchmarke0ng”  

–gcc  is  the  only  survivor  from  SPEC89  • Almost  70%  of  all  programs  from  SPEC2000  or  earlier  were  dropped  from  the  next  release

37

Page 38: Understanding Android Benchmarks

Other  benchmarks

• Stream  –To  test  memory  bandwidth  –It  also  tests  floa0ng-­‐point  performance  –Op0ons  of  floa0ng-­‐point  (double,  8  bytes)  array  

• copy,  scale,  add,  triad  

• lmbench  –Micro  benchmark  to  measure  sosware/hardware  overhead  from  sosware  perspec0ve  

– lmbench  paper  (1996),  hhp://www.bitmover.com/lmbench/lmbench-­‐usenix.pdf

name kernel bytes/iter FLOPS/iter

COPY a(i) = b(i) 16 0

SCALE a(i) = q*b(i) 16 1

SUM a(i) = b(i) + c(i) 24 1

TRIAD a(i) = b(i) + q*c(i) 24 2

38

Page 39: Understanding Android Benchmarks

Stream 5.10

for (k=0; k<NTIMES; k++) { times[0][k] = mysecond(); for (j=0; j<STREAM_ARRAY_SIZE; j++) c[j] = a[j]; times[0][k] = mysecond() - times[0][k]; times[1][k] = mysecond(); for (j=0; j<STREAM_ARRAY_SIZE; j++) b[j] = scalar*c[j]; times[1][k] = mysecond() - times[1][k]; times[2][k] = mysecond(); for (j=0; j<STREAM_ARRAY_SIZE; j++) c[j] = a[j]+b[j]; times[2][k] = mysecond() - times[2][k]; times[3][k] = mysecond(); for (j=0; j<STREAM_ARRAY_SIZE; j++) a[j] = b[j]+scalar*c[j]; times[3][k] = mysecond() - times[3][k]; }

39

Page 40: Understanding Android Benchmarks

lmbench

• lmbench  is  a  micro-­‐benchmark  suite  designed  to  focus  ahen0on  on  the  basic  building  blocks  of  many  common  system  applica0ons,  such  as  databases,  simula0ons,  sosware  development,  and  networking

40

Page 41: Understanding Android Benchmarks

Parallel?  Let’s  look  at  other  SPEC  benchmarks

• SPECapc  for  3ds  Max™  2011,  performance  evalua0on  sosware  for  systems  running  Autodesk  3ds  Max  2011.    

• SPECapcSM  for  Lightwave  3D  9.6,  performance  evalua0on  sosware  for  systems  running  NewTek  LightWave  3D  v9.6  sosware.    

• SPECjbb2005,  evaluates  the  performance  of  server  side  Java  by  emula0ng  a  three-­‐0er  client/server  system  (with  emphasis  on  the  middle  0er).    

• SPECjEnterprise2010,  a  mul0-­‐0er  benchmark  for  measuring  the  performance  of  Java  2  Enterprise  Edi0on  (J2EE)  technology-­‐based  applica0on  servers.    

• SPECjms2007,  Java  Message  Service  performance    

• SPECjvm2008,  measuring  basic  Java  performance  of  a  Java  Run0me  Environment  on  a  wide  variety  of  both  client  and  server  systems.    

• SPECapc,  performance  of  several  3D-­‐intensive  popular  applica0ons  on  a  given  system    

• SPEC  MPI2007,  for  evalua0ng  performance  of  parallel  systems  using  MPI  (Message  Passing  Interface)  applica0ons.    

• SPEC  OMP2001  V3.2,  for  evalua0ng  performance  of  parallel  systems  using  OpenMP  (hhp://www.openmp.org)  applica0ons.    

• SPECpower_ssj2008,  evaluates  the  energy  efficiency  of  server  systems.    

• SPECsfs2008,  File  server  throughput  and  response  0me  suppor0ng  both  NFS  and  CIFS  protocol  access    

• SPECsip_Infrastructure2011,  SIP  server  performance    

• SPECviewperf  11,  performance  of  an  OpenGL  3D  graphics  system,  tested  with  various  rendering  tasks  from  real  applica0ons    

• SPECvirt_sc2010  ("SPECvirt"),  evaluates  the  performance  of  datacenter  servers  used  in  virtualized  server  consolida0on  

41

Page 42: Understanding Android Benchmarks

PARSEC• The  Princeton  Applica0on  Repository  for  Shared-­‐Memory  Computers  (PARSEC)  is  a  benchmark  suite  composed  of  mul0threaded  programs.  The  suite  focuses  on  emerging  workloads  and  was  designed  to  be  representa0ve  of  next-­‐genera0on  shared-­‐memory  programs  for  chip-­‐mul0processors  

• Didn’t  really  use  it  yet  • hhp://parsec.cs.princeton.edu/

Workload

Parallelization Model

Pthreads OpenMP Intel TBB

blackscholes Yes Yes Yes

bodytrack Yes Yes Yes

canneal Yes No No

dedup Yes No No

facesim Yes No No

ferret Yes No No

fluidanimate Yes No Yes

freqmine No Yes No

raytrace Yes No No

streamcluster Yes No Yes

swaptions Yes No Yes

vips Yes No No

x264 Yes No No

42

Page 43: Understanding Android Benchmarks

Are Dhrystone usefully?

• Yes, if you know the limitation of them

• Don't do marketing as those benchmarks mean real user perceived performance

43

Page 44: Understanding Android Benchmarks

iPhone'5s' iPhone'5s'32,bit' CA15' CA7' Krait'400'DMIPS/MHz' 7.47'' 5.70'' 2.71'' 1.67'' 2.46''

0.00''1.00''2.00''3.00''4.00''5.00''6.00''7.00''8.00''

DMIPS/MHz)

A7 Dhrystone44

Page 45: Understanding Android Benchmarks

iPhone'5s' iPhone'5s'32,bit' 'CA15' CA7' Krait'400'

MFLOPS/GHz' 722' 723' 449' 119' 299'

0'

100'

200'

300'

400'

500'

600'

700'

800'

MFLOPS/GHz+

A7 linpack MFLOPS45

Page 46: Understanding Android Benchmarks

iPhone'5s' iPhone'5s'32,bit' CA15' CA7' Krait'400'CoreMark/MHz' 5.72'' 4.45'' 3.67'' 2.46'' 3.30''

0.00''

1.00''

2.00''

3.00''

4.00''

5.00''

6.00''

7.00''

CoreMark/MHz+

A7 CoreMark46

Page 47: Understanding Android Benchmarks

Different items

• Example, GeekBench 3

• Arithmetic mean with different weight? How?

• Good properties of geometric mean

47

Page 48: Understanding Android Benchmarks

Source code

• So far what we talked about are all software with source code available, either publicly/freely, e.g., Dhrystone or little amount of $, e.g., SPEC CPU

48

Page 49: Understanding Android Benchmarks

• Benchmark scores/results usually depend on compiler, complier flags, processors, and systems

49

Page 50: Understanding Android Benchmarks

Outline

• Performance benchmark review

• Some Android benchmarks

• What we did and what still can be done

• Future

50

Page 51: Understanding Android Benchmarks

Back to Android

• What kinds of Benchmarks are available, or used to compare performance

• Apps with native benchmarks: Antutu, GeekBench

• Java apps, e.g., Quadrant

• Hybrid: with both native and Java, e.g., AndEBench and CF-Bench

• We also use SPEC CPU2000 and other stuff internally

51

Page 52: Understanding Android Benchmarks

Ars Technica ListarrayOfPackageInfo[0]  =  new  PackageInfo("com.aurorasoftworks.quadrant.ui.standard",  false);  arrayOfPackageInfo[1]  =  new  PackageInfo("com.aurorasoftworks.quadrant.ui.advanced",  false);  arrayOfPackageInfo[2]  =  new  PackageInfo("com.aurorasoftworks.quadrant.ui.professional",  false);  arrayOfPackageInfo[3]  =  new  PackageInfo("com.redlicense.benchmark.sqlite",  false);  arrayOfPackageInfo[4]  =  new  PackageInfo("com.antutu.ABenchMark",  false);  arrayOfPackageInfo[5]  =  new  PackageInfo("com.greenecomputing.linpack",  false);  arrayOfPackageInfo[6]  =  new  PackageInfo("com.greenecomputing.linpackpro",  false);  arrayOfPackageInfo[7]  =  new  PackageInfo("com.glbenchmark.glbenchmark27",  false);  arrayOfPackageInfo[8]  =  new  PackageInfo("com.glbenchmark.glbenchmark25",  false);  arrayOfPackageInfo[9]  =  new  PackageInfo("com.glbenchmark.glbenchmark21",  false);  arrayOfPackageInfo[10]  =  new  PackageInfo("ca.primatelabs.geekbench2",  false);  arrayOfPackageInfo[11]  =  new  PackageInfo("com.eembc.coremark",  false);  arrayOfPackageInfo[12]  =  new  PackageInfo("com.flexycore.caffeinemark",  false);  arrayOfPackageInfo[13]  =  new  PackageInfo("eu.chainfire.cfbench",  false);  arrayOfPackageInfo[14]  =  new  PackageInfo("gr.androiddev.BenchmarkPi",  false);  arrayOfPackageInfo[15]  =  new  PackageInfo("com.smartbench.twelve",  false);  arrayOfPackageInfo[16]  =  new  PackageInfo("com.passmark.pt_mobile",  false);  arrayOfPackageInfo[17]  =  new  PackageInfo("se.nena.nenamark2",  false);  arrayOfPackageInfo[18]  =  new  PackageInfo("com.samsung.benchmarks",  false);  arrayOfPackageInfo[19]  =  new  PackageInfo("com.samsung.benchmarks:db",  false);  arrayOfPackageInfo[20]  =  new  PackageInfo("com.samsung.benchmarks:es1",  false);  arrayOfPackageInfo[21]  =  new  PackageInfo("com.samsung.benchmarks:es2",  false);  arrayOfPackageInfo[22]  =  new  PackageInfo("com.samsung.benchmarks:g2d",  false);  arrayOfPackageInfo[23]  =  new  PackageInfo("com.samsung.benchmarks:fs",  false);  arrayOfPackageInfo[24]  =  new  PackageInfo("com.samsung.benchmarks:ks",  false);  arrayOfPackageInfo[25]  =  new  PackageInfo("com.samsung.benchmarks:cpu  !!CPU and Memory related: Quadrant, Antutu, linpack, GeekBench, AndEBench (coremark), CaffeineMark, Pi, PassMark, Samsung’s benchmark

52

Page 53: Understanding Android Benchmarks

Antutu 3.x• CPU: integer, floating point

• memory: RAM

• Graphics: 2D, 3D

• I/O: Database, SD read, SD write

!

!

• What are you benchmarking

• What's you workload

• How to calculate scores

53

Page 54: Understanding Android Benchmarks

What on earth are they doing?

• Actually no public available information

• But, with good enough background knowledge and proper tools (we’ll talk about these later), we can figure it out

• It turns out most of them are from the BYTE nbench (http://en.wikipedia.org/wiki/NBench)

54

Page 55: Understanding Android Benchmarks

AnTuTu  3.x  CPU  and  Memory  Tests

nbench item Used by Antutu Antutu part

Antutu percentage on progress bar Order nbench category

NUMERIC SORT yes Integer 27% 4 integer

STRING SORT yes RAM 1% 1 memory

BITFIELD yes RAM 1% 2 memory

FP EMULATION no

FOURIER yes floating 47% 7 floating point

ASSIGNMENT yes RAM 8% 3 memory

IDEA yes Integer 27% 5 integer

HUFFMAN yes Integer 34% 6 integer

NEURAL NET no

LU DECOMPOSITION no

55

Page 56: Understanding Android Benchmarks

More  close  look▪ RAM

– String sort: • string Heap sort: StrHeapSort() • MoveMemory() à memmove()

– Bit Field: • Bit field test: DoBitops()

– Assignment: • Task Assignment test: DoAssignment()

▪ Integer – Numeric sort:

• Numeric heap sort: NumHeapSort() – IDEA:

• IDEA encryption and decryption: cipher_idea() – Huffman:

• Huffman encoding

▪ Floating point: – Fourier:

• Fourier transform: pow(), sin(), cos()

56

Page 57: Understanding Android Benchmarks

for(i=top; i>0; --i) !{ !

"strsift(optrarray,strarray,numstrings,0,i); !!

"/* temp = string[0] */!"tlen=*strarray; !"MoveMemory((farvoid *)&temp[0], /* Perform exchange */ !" "(farvoid *)strarray, !" "(unsigned long)(tlen+1)); !

!!

"/* string[0]=string[i] */!"tlen=*(strarray+*(optrarray+i)); !"stradjust(optrarray,strarray,numstrings,0,tlen); !"MoveMemory((farvoid *)strarray, !" "(farvoid *)(strarray+*(optrarray+i)), !" "(unsigned long)(tlen+1)); !

!"/* string[i]=temp */!"tlen=temp[0]; !"stradjust(optrarray,strarray,numstrings,i,tlen); !"MoveMemory((farvoid *)(strarray+*(optrarray+i)), !" "(farvoid *)&temp[0], !" "(unsigned long)(tlen+1)); !

!}

String Sort in NBench• Sorts an array of strings

of arbitrary length

• Test memory movement performance

• Non-sequential performance of cache, with added burden that moves are byte-wide and can occur on odd address boundaries

57

Page 58: Understanding Android Benchmarks

Bit field in NBench• Executes 3 bit manipulation functions

• Exercises "bit twiddling“ performance. Travels through memory bit-by-bit in a sequential fashion; different from sorts in that data is merely altered in place

• Operations:

• Set: OR 1

• Clear: AND 0

• Toggle: XOR

• Set, clear: ToggleBitRun()

• Toggle: FlipBitRun()

static void ToggleBitRun(farulong *bitmap, /* Bitmap */ ulong bit_addr, /* Address of bits to set */ ulong nbits, /* # of bits to set/clr */ uint val) /* 1 or 0 */ { unsigned long bindex; /* Index into array */ unsigned long bitnumb; /* Bit number */ !while(nbits--) { #ifdef LONG64 bindex=bit_addr>>6; /* Index is number /64 */ bitnumb=bit_addr % 64; /* Bit number in word */ #else bindex=bit_addr>>5; /* Index is number /32 */ bitnumb=bit_addr % 32; /* bit number in word */ #endif if(val) bitmap[bindex]|=(1L<<bitnumb); else bitmap[bindex]&=~(1L<<bitnumb); bit_addr++; } return; }

58

Page 59: Understanding Android Benchmarks

Assignment in NBench• The test moves through

large integer arrays in both row-wise and column-wise fashion. Cache/memory with good sequential performance should see a boost (memory is altered in place -- no moving as in a sort operation)

• Yes, basically, sequential array assignment with some kind of table look-ups

/* ** Step through rows. For each one that is not currently ** assigned, see if the row has only one zero in it. If so, ** mark that as an assigned row/col. Eliminate other zeros ** in the same column. */ for(i=0;i<ASSIGNROWS;i++) { numzeros=0; for(j=0;j<ASSIGNCOLS;j++) if(tableau[i][j]==0L) if(assignedtableau[i][j]==0) { numzeros++; selected=j; } if(numzeros==1) { numassigns++; totnumassigns++; assignedtableau[i][selected]=1; for(k=0;k<ASSIGNROWS;k++) if((k!=i) && (tableau[k][selected]==0)) assignedtableau[k][selected]=2; } }

59

Page 60: Understanding Android Benchmarks

Numeric Sort in NBench

• Sorts an array of long integers with heap sort

• Generic integer performance. Should exercise non-sequential performance of cache (or memory if cache is less than 8K). Moves 32-bit longs at a time, so 16-bit processors will be at a disadvantage

static void NumHeapSort(farlong *array, ulong bottom, /* Lower bound */ ulong top) /* Upper bound */ { ulong temp; /* Used to exchange elements */ ulong i; /* Loop index */ !/* ** First, build a heap in the array */ for(i=(top/2L); i>0; --i) NumSift(array,i,top); !/* ** Repeatedly extract maximum from heap and place it at the ** end of the array. When we get done, we'll have a sorted ** array. */ for(i=top; i>0; --i) { NumSift(array,bottom,i); temp=*array; /* Perform exchange */ *array=*(array+i); *(array+i)=temp; } return;

60

Page 61: Understanding Android Benchmarks

static void cipher_idea(u16 in[4], !" "u16 out[4], !" "register IDEAkey Z) !

{ !register u16 x1, x2, x3, x4, t1, t2; !/* register u16 t16; !register u16 t32; */!int r=ROUNDS; !!x1=*in++; !x2=*in++; !x3=*in++; !x4=*in; !!do { !

"MUL(x1,*Z++); !"x2+=*Z++; !"x3+=*Z++; !"MUL(x4,*Z++); !

!"t2=x1^x3; !"MUL(t2,*Z++); !"t1=t2+(x2^x4); !"MUL(t1,*Z++); !"t2=t1+t2; !

!"x1^=t1; !"x4^=t2; !

!"t2^=x2; !"x2=x3^t1; !"x3=t2; !

} while(--r); !MUL(x1,*Z++); !*out++=x1; !*out++=x3+*Z++; !*out++=x2+*Z++; !MUL(x4,*Z); !*out=x4; !return; !}

IDEA Encryption in NBench

• IDEA: a new block cipher when nbench was in development

• Moves through data sequentially in 16-bit chunks

61

Page 62: Understanding Android Benchmarks

Huffman in NBench

• Everybody knows Huffman code, right?

• A combination of byte operations, bit twiddling, and overall integer manipulation

..... /* ** Huffman tree built...compress the plaintext */ bitoffset=0L; /* Initialize bit offset */ for(i=0;i<arraysize;i++) { c=(int)plaintext[i]; /* Fetch character */ /* ** Build a bit string for byte c */ bitstringlen=0; while(hufftree[c].parent!=-2) { if(hufftree[hufftree[c].parent].left==c) bitstring[bitstringlen]='0'; else bitstring[bitstringlen]='1'; c=hufftree[c].parent; bitstringlen++; } .....

62

Page 63: Understanding Android Benchmarks

Fourier in NBench

• No, not FFT,

• Good measure of transcendental and trigonometric performance of FPU. Little array activity, so this test should not be dependent of cache or memory architecture

static double thefunction(double x, /* Independent variable */!" "double omegan, /* Omega * term */!

" "int select) /* Choose term */!{ !/* !

** Use select to pick which function we call. !*/ !

switch(select) !{ !

"case 0: return(pow(x+(double)1.0,x)); !

"case 1: return(pow(x+(double)1.0,x) * cos(omegan * x)); !"case 2: return(pow(x+(double)1.0,x) * sin(omegan * x)); !

}

63

Page 64: Understanding Android Benchmarks

Neural Net in NBench

• A robust algorithm for solving linear equations

• Small-array floating-point test heavily dependent on the exponential function; less dependent on overall FPU performance

64

Page 65: Understanding Android Benchmarks

LU Decomposition in NBench

• LU Decomposition

• Yes, the LU decomposition you learned in linear algebra

• A floating-point test that moves through arrays in both row-wise and column-wise fashion. Exercises only fundamental math operations (+, -, *, /)

65

Page 66: Understanding Android Benchmarks

GeekBench• A cross-platform one

• The only publicly available one we could use to compare Android, iOS, and other platforms

• Quite clearly described test items

• http://support.primatelabs.com/kb/geekbench/geekbench-3-benchmarks

• Explaining how to interpret results

• http://support.primatelabs.com/kb/geekbench/interpreting-geekbench-3-scores

• Source code available if you pay

66

Page 67: Understanding Android Benchmarks

Vellamo

• HTML5

• Metal: Dhrystone, Linpack, Branch-K, Stream 5.9, RamJam, Storage

• some are well-known; some are written by Quic?

• Anyway, all of them are described at http://www.quicinc.com/vellamo/test-descriptions/

67

Page 68: Understanding Android Benchmarks

CFBench

• Used by some people, ‘cause

• Test both Java and native version

• its author is quite active in xda developer forum

• Some problems

• no good description of tests

• some code is wrong, e.g.,

• its Native Memory Read test is not testing memory read, ‘cause malloc()ed array is not initialized

68

Page 69: Understanding Android Benchmarks

Outline

• Performance benchmark review

• Some Android benchmarks

• What we did and what still can be done

• Future

69

Page 70: Understanding Android Benchmarks

How do we improve benchmark

performance

70

Page 71: Understanding Android Benchmarks

• In the good old days, we have source code, we compile and run benchmark programs

• In current Android ecosystem

• Usually we don’t have source

• Profiling: oprofile, perf, DS-5

• profiling sometimes doesn’t report real bottleneck function, e.g., static functions usually are inlined and don’t have symbol in shipped binaries

• binutils: nm, readelf, objdump, gdb

• Improving libraries, e.g., libc and libm, and runtime system, e.g., JIT of Dalvik, used by those benchmarks

71

Page 72: Understanding Android Benchmarks

Antutu 3.x

• memmove() in bionic --> bcopy() in C

• rewrite with NEON assembly code

• pow(), sin(), cos() in C

• rewrite them with assembly

72

Page 73: Understanding Android Benchmarks

bcopy() in bionic

• MoveMemory() in nbench -> memmove() in bionic -> bcopy() in bionic

• memcpy() assembly in bionic and there are processor specific ones (CA9, CA15, Krait). NEON (vector load/store) helps

• not for bcopy()

in bionic/libc/bionic/memmove.c !void *memmove(void *dst, const void *src, size_t n) { const char *p = src; char *q = dst; /* We can use the optimized memcpy if the source and destination * don't overlap. */ if (__builtin_expect(((q < p) && ((size_t)(p - q) >= n)) || ((p < q) && ((size_t)(q - p) >= n)), 1)) { return memcpy(dst, src, n); } else { bcopy(src, dst, n); return dst; } }

in bionic/libc/string/bcopy.c /* * Copy a block of memory, handling overlap. * This is the routine that actually implements * (the portable versions of) bcopy, memcpy, and memmove. */ #ifdef MEMCOPY void * memcpy(void *dst0, const void *src0, size_t length) #else #ifdef MEMMOVE void * memmove(void *dst0, const void *src0, size_t length) #else void bcopy(const void *src0, void *dst0, size_t length) #endif #endif { .....

73

Page 74: Understanding Android Benchmarks

Antutu 3.x

• For people with source code

• Selection of toolchain and compiler options may cause huge difference, e.g., bit field

• Some version of x86 binary for Antutu 3.x was compiled with Intel, bit-by-bit operations turned in word-wide (32-bit) operations, and the speed up is about 70x faster

74

Page 75: Understanding Android Benchmarks

Stream copy usually turned into memcpy()

75

Page 76: Understanding Android Benchmarks

remote gdb1. get /system/bin/app_process and /system/bin/linker of the target system and necessary

shared libraries, e.g., /data/data/eu.chainfire.cfbench/lib/libCFBench.so

• adb pull /system/bin/app_process!

• adb pull /system/bin/linker lib/armeabi-v7a/!

• adb pull /data/data/eu.chainfire.cfbench/lib/libCFBench.so lib/armeabi-v7a/!

2. arm-linux-gnueabi-gdb ./app_process

3. on the target device, attach gdbserver to the running process you wanna debug

• ./gdbserver --attach :5039 3484

4. set shared library search path

• (gdb) set solib-search-path /Users/freedom/tmp/cfbench/lib/armeabi-v7a

5. ‘adb forward tcp:5039 tcp:5039’ and set remote target

• (gdb) target remote :5039

6. you can set breakpoints, print backtrace, disassemble, etc.

76

Page 77: Understanding Android Benchmarks

• (gdb) b Java_eu_chainfire_cfbench_BenchNative_benchMemReadAligned

• (gdb) disassemble

Dump of assembler code for function Java_eu_chainfire_cfbench_BenchNative_benchMemReadAligned: 0x74b65848 <+0>: stmdb sp!, {r4, r5, r6, r7, r8, r9, r10, lr} => 0x74b6584c <+4>: bl 0x74b654ac <loadLib> 0x74b65850 <+8>: mov.w r0, #1048576 ; 0x100000 0x74b65854 <+12>: blx 0x74b65358 0x74b65858 <+16>: movs r6, #0 0x74b6585a <+18>: movw r9, #9999 ; 0x270f 0x74b6585e <+22>: mov r8, r0 0x74b65860 <+24>: bl 0x74b6547c <getTickCount> 0x74b65864 <+28>: add.w r5, r8, #1048576 ; 0x100000 0x74b65868 <+32>: mov r10, r0 0x74b6586a <+34>: mov r3, r8 0x74b6586c <+36>: ldr.w r2, [r3], #4 0x74b65870 <+40>: cmp r3, r5 0x74b65872 <+42>: add r4, r2 0x74b65874 <+44>: bne.n 0x74b6586c <Java_eu_chainfire_cfbench_BenchNative_benchMemReadAligned+36> 0x74b65876 <+46>: bl 0x74b6547c <getTickCount> 0x74b6587a <+50>: adds r6, #1 0x74b6587c <+52>: rsb r7, r10, r0 0x74b65880 <+56>: cmp r7, r9

77

Page 78: Understanding Android Benchmarks

Quadrant

• Written in Java

• CPU: Not really testing CPU

• Memory: profiling shows that memcpy() is heavily in used

• What can we do

• optimized JIT part of DVM

78

Page 79: Understanding Android Benchmarks

What other possible ways?

• binary translation during

• installation time

• run time

79

Page 80: Understanding Android Benchmarks

Wrap-up

• Popular CPU and Memory benchmarks on Android mostly don’t reflect real CPU performance

• We know CPU performance != System performance != user-perceived performance

• There is always room for improvement

80

Page 81: Understanding Android Benchmarks

So?

81

Page 82: Understanding Android Benchmarks

Recent progress

• EEMBC’s AndEBench 2.0 is under development (http://www.eembc.org/press/pressrelease/130128.html)

• Qualcomm asked BDTi to develop new benchmark (http://www.qualcomm.com/media/blog/2013/08/16/mobile-benchmarking-turning-corner-user-experience).

• Samsung with other vendors launched MobileBench Consortium last year

• Antutu is still growing

82

Page 83: Understanding Android Benchmarks

Thanks!

Page 84: Understanding Android Benchmarks

廣告• MediaTek joined

linaro.org last month

• linaro.org is a NPO working on open source Linux/Android related stuff for ARM-based SoCs

• So MTK is getting more open recently

• And, it’s looking for open source engineers

• Talk to guys at MTK booth or me

• There are more non-open source jobs

84

Page 85: Understanding Android Benchmarks

backup

85

Page 86: Understanding Android Benchmarks

Some References to Understand Performance Benchmark

• Raj Jain, “The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling”, Wiley, 1991

• Quantitative Approach

• A good SPEC introduction article, http://mrob.com/pub/comp/benchmarks/spec.html

• Kaivalya M. Dixit, “Overview of the SPEC Benchmarks,” http://people.cs.uchicago.edu/~chliu/doc/benchmark/chapter9.pdf

86

Page 87: Understanding Android Benchmarks

Basic system parameters  

------------------------------------------------------------------------------  

Host OS Description Mhz tlb cache mem scal  

pages line par load  

bytes  

--------- ------------- ----------------------- ---- ----- ----- ------ ----  

localhost Linux 3.4.5-g armv7l-linux-gnu 1696 7 64 4.4700 1  

!Processor, Processes - times in microseconds - smaller is better  

------------------------------------------------------------------------------  

Host OS Mhz null null open slct sig sig fork exec sh  

call I/O stat clos TCP inst hndl proc proc proc  

--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----  

localhost Linux 3.4.5-g 1696 0.49 0.67 2.54 5.95 8.52 0.67 5.05 876. 1668 4654  

!Basic integer operations - times in nanoseconds - smaller is better  

-------------------------------------------------------------------  

Host OS intgr intgr intgr intgr intgr  

bit add mul div mod  

--------- ------------- ------ ------ ------ ------ ------  

localhost Linux 3.4.5-g 1.0700 0.1100 3.4000 90.5 14.8  

!Basic float operations - times in nanoseconds - smaller is better  

-----------------------------------------------------------------  

87

Page 88: Understanding Android Benchmarks

Context switching - times in microseconds - smaller is better  

-------------------------------------------------------------------------  

Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K  

ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw  

--------- ------------- ------ ------ ------ ------ ------ ------- -------  

localhost Linux 3.4.5-g 8.9700 4.9000 6.1400 12.3 7.68000 57.6  

!*Local* Communication latencies in microseconds - smaller is better  

---------------------------------------------------------------------  

Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP  

ctxsw UNIX UDP TCP conn  

--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----  

localhost Linux 3.4.5-g 8.970 17.6 23.9 47.5 71.3 357.  

!File & VM system latencies in microseconds - smaller is better  

-------------------------------------------------------------------------------  

Host OS 0K File 10K File Mmap Prot Page 100fd  

Create Delete Create Delete Latency Fault Fault selct  

--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----  

localhost Linux 3.4.5-g 700.0 1.259 2.55270 3.048  

!*Local* Communication bandwidths in MB/s - bigger is better  

-----------------------------------------------------------------------------  

Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem  

88

Page 89: Understanding Android Benchmarks

PARSEC  content• Blackscholes    This  applica0on  is  an  Intel  RMS  benchmark.  It  calculates  the  prices  for  a  por|olio  of  European  op0ons  

analy0cally  with  the  Black-­‐Scholes  par1al  differen1al  equa1on  (PDE).  There  is  no  closed-­‐form  expression  for  the  Black-­‐Scholes  equa0on  and  as  such  it  must  be  computed  numerically.    

• Bodytrack  This  computer  vision  applica0on  is  an  Intel  RMS  workload  which  tracks  a  human  body  with  mul1ple  cameras  through  an  image  sequence.  This  benchmark  was  included  due  to  the  increasing  significance  of  computer  vision  algorithms  in  areas  such  as  video  surveillance,  character  anima0on  and  computer  interfaces.    

• Canneal    This  kernel  was  developed  by  Princeton  University.  It  uses  cache-­‐aware  simulated  annealing  (SA)  to  minimize  the  rou1ng  cost  of  a  chip  design.  Canneal  uses  fine-­‐grained  parallelism  with  a  lock-­‐free  algorithm  and  a  very  aggressive  synchroniza0on  strategy  that  is  based  on  data  race  recovery  instead  of  avoidance.    

• Dedup  This  kernel  was  developed  by  Princeton  University.  It  compresses  a  data  stream  with  a  combina1on  of  global  and  local  compression  that  is  called  'deduplica1on'.  The  kernel  uses  a  pipelined  programming  model  to  mimic  real-­‐world  implementa0ons.  The  reason  for  the  inclusion  of  this  kernel  is  that  deduplica0on  has  become  a  mainstream  method  for  new-­‐genera0on  backup  storage  systems.    

• Facesim  This  Intel  RMS  applica0on  was  originally  developed  by  Stanford  University.  It  computes  a  visually  realis1c  anima1on  of  the  modeled  face  by  simula1ng  the  underlying  physics.  The  workload  was  included  in  the  benchmark  suite  because  an  increasing  number  of  anima0ons  employ  physical  simula0on  to  create  more  realis0c  effects.    

• Ferret    This  applica0on  is  based  on  the  Ferret  toolkit  which  is  used  for  content-­‐based  similarity  search.  It  was  developed  by  Princeton  University.  The  reason  for  the  inclusion  in  the  benchmark  suite  is  that  it  represents  emerging  next-­‐genera0on  search  engines  for  non-­‐text  document  data  types.  In  the  benchmark,  we  have  configured  the  Ferret  toolkit  for  image  similarity  search.  Ferret  is  parallelized  using  the  pipeline  model.

89

Page 90: Understanding Android Benchmarks

PARSEC  content• Fluidanimate    This  Intel  RMS  applica0on  uses  an  extension  of  the  Smoothed  Par0cle  Hydrodynamics  (SPH)  method  to  

simulate  an  incompressible  fluid  for  interac1ve  anima1on  purposes.  It  was  included  in  the  PARSEC  benchmark  suite  because  of  the  increasing  significance  of  physics  simula0ons  for  anima0ons.    

• Freqmine    This  applica0on  employs  an  array-­‐based  version  of  the  FP-­‐growth  (Frequent  PaMern-­‐growth)  method  for  Frequent  Itemset  Mining  (FIMI).  It  is  an  Intel  RMS  benchmark  which  was  originally  developed  by  Concordia  University.  Freqmine  was  included  in  the  PARSEC  benchmark  suite  because  of  the  increasing  use  of  data  mining  techniques.    

• Raytrace    The  Intel  RMS  applica0on  uses  a  version  of  the  raytracing  method  that  would  typically  be  employed  for  real-­‐0me  anima0ons  such  as  computer  games.  It  is  op0mized  for  speed  rather  than  realism.  The  computa0onal  complexity  of  the  algorithm  depends  on  the  resolu0on  of  the  output  image  and  the  scene.    

• Streamcluster    This  RMS  kernel  was  developed  by  Princeton  University  and  solves  the  online  clustering  problem.  Streamcluster  was  included  in  the  PARSEC  benchmark  suite  because  of  the  importance  of  data  mining  algorithms  and  the  prevalence  of  problems  with  streaming  characteris0cs.    

• Swap1ons    The  applica0on  is  an  Intel  RMS  workload  which  uses  the  Heath-­‐Jarrow-­‐Morton  (HJM)  framework  to  price  a  porRolio  of  swap1ons.  Swap0ons  employs  Monte  Carlo  (MC)  simula0on  to  compute  the  prices.    

• Vips    This  applica0on  is  based  on  the  VASARI  Image  Processing  System  (VIPS)  which  was  originally  developed  through  several  projects  funded  by  European  Union  (EU)  grants.  The  benchmark  version  is  derived  from  a  print  on  demand  service  that  is  offered  at  the  Na0onal  Gallery  of  London,  which  is  also  the  current  maintainer  of  the  system.  The  benchmark  includes  fundamental  image  opera0ons  such  as  an  affine  transforma0on  and  a  convolu0on.    

• X264

90


Top Related