lecture 4: evaluating performance instructor: dimitris

86
CS2504, Spring'2007 ©Dimitris Nikolopoulos CS2504: Computer Organization Lecture 4: Evaluating Performance Instructor: Dimitris Nikolopoulos Guest Lecturer: Matthew Curtis-Maury

Upload: others

Post on 25-Jan-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

CS2504: Computer Organization

Lectu

re 4

: Evalu

ati

ng P

erf

orm

ance

Inst

ructo

r: D

imit

ris

Nik

olo

poulo

s

Guest

Lectu

rer:

Matt

hew

Curt

is-M

aury

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los 2

Understanding Perform

ance

�Why do we study perform

ance?

−Evaluate during design

−Evaluate before purchasing

−Key to understanding underlying organizational motivation

�How can we (meaningfully) compare two m

achines?

−Perform

ance, Cost, Value, etc

�Main issue:

−Need to understand what factors in the architecture

contribute to overall system perform

ance and the relative

importance of these factors

�Effects of ISA on perform

ance

�How will hardware change affect perform

ance

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los 3

Airplane Perform

ance Analogy

1500

2000

4Fighter Jet

544

8720

146

Douglas DC-8-50

1250

4000

132

Concorde

610

4150

470

Boeing 747

610

4630

375

Boeing 777

Sp

ee

dR

an

ge

Pass

en

ge

rsA

irp

lan

e

�What metric do we use?

−Concorde is 2.05times faster than the 747

−747 has 1.74times higher throughput

−What about cost?

�And the winner is:

It Depends!

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los 4

Throughput vs. Response Time

�Response Time:

−Execution time (e.g. seconds or clock ticks)

−How long does the program take to execute?

−How long do I have to wait for a result?

�Throughput:

−Rate of completion (e.g. results per second/tick)

−What is the average execution time of the program?

−Measure of total work done

�Upgrading to a newer processor will improve: response time

�Adding processors to the system will improve: throughput

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los 5

Example: Throughput vs. Response Time

�Suppose we know that an application that uses both a

desktop client and a remote server is limited by

netw

ork perform

ance. For the following changes, which

of throughput, response time, both, or neither is

improved?

−An extra netw

ork channel is added betw

een the client and the

server, increasing the total netw

ork throughput and reducing the

delay to obtain netw

ork access.

−The netw

orking software is improved, thereby reducing the

netw

ork communication delay, but not increasing throughput.

−More m

emory is added to the computer.

Th

rou

ghp

ut

is i

mp

rov

ed d

irec

tly a

nd

res

po

nse

tim

e is

im

pro

ved

by

redu

cing

del

ay.

Res

pon

se t

ime

is i

mp

rov

ed d

irec

tly.

May

be

nei

ther

. M

ayb

e re

spo

nse

tim

e.

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los 6

Design Goals

•P

erfo

rma

nce

–m

axim

um

sp

eed

•C

ost–

circ

uit

siz

e

•V

alu

e–

bes

t p

rice

-per

form

ance

rat

io

•E

ner

gy–

leas

t en

ergy

con

sum

pti

on

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los 7

Definition of Perform

ance

•P

erfo

rman

ce i

s in

vers

ely

pro

po

rtio

nal

to

tim

e:

–T

o m

axim

ize

per

form

ance

, m

inim

ize

exec

uti

on

tim

e

per

form

ance

X=

1 /

ex

ecu

tio

n_

tim

e X

•“

X i

s n

tim

es f

aste

r th

an Y”

–E

xec

uti

on

tim

e o

n Y

is

nti

mes

lo

ng

er t

han

on

X

perform

ance

Xexecution_timeY

perform

ance

Yexecution_timeX

==

n

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los 8

Example: Perform

ance Calculation

•A

par

ticu

lar

mu

ltip

roce

sso

r se

rver

mac

hin

e’s

per

form

ance

is

4 t

imes

bet

ter

than

a g

iven

un

ipro

cess

or

des

kto

p s

yst

em. If

th

e d

esk

top s

yst

em r

un

s an

app

lica

tio

n i

n 2

8 s

econ

ds,

how

lo

ng

wil

l it

tak

e on

th

e se

rver

?

per

form

ance

serv

er

28

sec

on

ds

per

form

ance

des

kto

p ti

me s

erver

tim

e ser

ver

= 2

8 /

4 =

7 s

econ

ds

==

4

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los 9

Example: Relative Perform

ance

•If

a p

arti

cula

r d

esk

top

ru

ns

a p

rog

ram

in

60

sec

on

ds

and

a

lap

top

run

s th

e sa

me

pro

gra

m i

n 7

5 s

econ

ds,

ho

w

mu

ch f

aste

r is

th

e d

esk

top

th

an t

he

lap

top

?

Per

form

ance

des

kto

p=

1/e

xec

uti

on_ti

me d

esk

top

= 1

/60

Per

form

ance

lap

top

= 1

/exec

uti

on_ti

me l

apto

p=

1/7

5

Per

form

ance

des

kto

p/

Per

form

ance

lap

top

= (

1/6

0)/

(1/7

5)

= 1

.25

Or

sim

ply

: ex

ecuti

on_ti

me l

apto

p /

exec

uti

on_ti

me d

esk

top

= 1

.25

So

, th

e des

kto

p i

s 1

.25 t

imes

fas

ter

than

th

e la

pto

p

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

10

Real-time Constraints

•W

hat

are

rea

l-ti

me

syst

ems?

–S

yst

ems

wit

h “

op

erat

ion

al d

ead

lin

es f

rom

even

t to

syst

em r

espo

nse

•E

xam

ple

s of

real

-tim

e sy

stem

s:–

An

tilo

ck b

rak

e sy

stem

•H

ard r

eal-

tim

e

•M

ust

know

im

med

iate

ly i

f th

e bra

kes

hav

e lo

cked

–V

ideo

pla

yb

ack

•S

oft

rea

l-ti

me

•W

ant

to m

ake

most

dea

dli

nes

to a

void

im

age

jitt

er

•R

eal-

tim

e def

init

ion o

f per

form

ance

:–

“Are

th

e d

ead

lin

es m

et?”

–C

on

sequ

ence

s o

n d

esig

n:

min

imiz

e co

st t

o m

eet

dea

dli

nes

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

11

Application-Specific Metrics

�Applications depend on different parts of computer

−Scientific applications: CPU and m

emory

−Server applications: I/O

�Also, need to find the right corresponding metric

−Wall-clock time

−Throughput

−Both (maximum throughput with worst case response tim

e)

�So, need to keep m

etric in m

ind for optimization

−Identify bottleneck in term

s of the target metric

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

12

Measuring Execution Time

•T

ime

is t

he

ult

ima

tem

easu

re o

f p

erfo

rman

ce

–S

ame

work

in l

ess

tim

e =

bet

ter

per

form

ance

•M

ost

str

aig

ht-

forw

ard

def

init

ion

of

tim

e

–W

all-

clock

tim

e, r

esponse

tim

e, e

lapse

d t

ime

–T

ota

l ti

me

to c

om

ple

tion o

f a

task

•C

PU

tim

e

–A

mount

of

tim

e th

e ta

sk w

as a

ctual

ly e

xec

uti

ng

–D

oes

n’t

incl

ude

I/O

or

runnin

g o

ther

tas

ks

•W

e w

ill

gen

eral

ly u

se C

PU

tim

e

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

13

Clock Cycles

�Clock Cyclesare a direct m

easure of time

−Measures how fast the computer can perform

basic functions

−Discrete time interval in the CPU

�Clock periodis the tim

e for one clock cycle (seconds)

�Clock rate

is the inverse of clock period (cycles/second)

�5 nsec clock cycle => 200 MHz clock rate

�500 psec clock cycle => 2 GHz clock rate

�200 psec clock cycle => 5 GHz clock rate

on

e c

lock p

eri

od

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

14

Execution Time Form

ula

•R

elat

ing

cy

cles

to s

econ

ds:

CP

U_

tim

e =

CP

U_

cycl

es *

cycl

e_ti

me

or

CP

U_

tim

e =

CP

U_

cycl

es /

clo

ck_

rate

•S

o t

o i

mp

rove

per

form

ance

we

hav

e tw

o o

pti

on

s

–D

ecre

ase

nu

mb

er o

f cy

cles

in

a p

rog

ram

–In

crea

se t

he

clo

ck r

ate

(dec

reas

e cy

cle

tim

e)

–H

ow

ever

, th

ese

are

oft

en a

t o

dds

wit

h e

ach

oth

er

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

15

Example: Im

proving Perform

ance

Our favorite program runs in 10 seconds on computer A (4GHz). Weare

designing a computer to run the same program in 6 seconds. If

increasing the clock rate will require 1.2 times as many cycles for

computer B, what clock rate do we need?

Number of clock cycles executed by A:

10 seconds =

clock_cycles A

4*109cycles/second

clock_cycles A= 10 seconds * (4*10^9 cycles/second) = 40*10^9 cycles

Then we find the clock rate needed on computer B:

6 seconds = 1.2*(40*109cycles) / clock_rate

B

clock_rate

B= 1.2*(40*109cycles) / 6 seconds = 8*109cyc/sec = 8 GHz

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

16

Determ

ining Clock Cycles

�So what determ

ines the number of cycles

required to execute an application?

One possibility: #Cycles = #Instructions

However, this is NOT true because different

instructions take different amounts of time

Tim

e

Inst

ruct

ion 1

23

45

67

8P

rogra

m

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

17

Determ

ining Clock Cycles

•A

mo

re r

eali

stic

pic

ture

of

wh

at’s

hap

pen

ing…

–F

loat

ing

poin

t o

per

atio

ns

can

tak

e lo

ng

er t

han

in

teg

er

–M

ult

ipli

cati

on

tak

es l

on

ger

th

an a

dd

itio

n

–M

emo

ry a

cces

ses

can

tak

e m

any

cy

cles

to

com

ple

te

Clo

ck c

ycl

es =

In

stru

ctio

ns

* A

vg

Cy

cles

Per

In

stru

ctio

n

Tim

e

Inst

ruct

ion 1

23

45

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

18

Example: Calculating Time

�Suppose we have two implementations of the same ISA.

Computer A has a cycle time of 250 ps and a CPI (cycles per

instruction) of 2.0 for some program, and computer B has a

cycle time of 500 ps and a CPI of 1.2 for the same program.

Which computer is faster for this program?

Note: A constant number of instructions will be executed:

I

clock_cycles A= I* 2.0 and clock_cycles B= I* 1.2

timeA= I* 2.0 * 250 ps = 500 * Ips and tim

eB= I* 1.2 * 500 ps = 600 * Ips

perform

ance

AtimeB

600 * I ps

perform

ance

BtimeA

500 * I ps

==

= 1.2

C

om

pu

ter

A i

s 1.2

tim

es f

ast

er

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

19

Clock Cycles per Instruction

�Then what is average Cycles Per Instruction?

−The average number of cycles each instruction takes

−A way to compare two implementations of one ISA

�CPI is dependent on the instruction m

ix

−This is the composition of different types of instructions in

an application

�Aware of variation in CPI by instruction type

−Averages across all instructions executed

−Specific to a given instruction sequence

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

20

Effective CPI

•M

axim

um

CP

I–

CP

I w

ith i

nst

ruct

ion m

ix o

f ex

clusi

vel

y t

he

short

est

inst

ruct

ion

•C

alcu

lati

ng c

lock

cy

cles

–IC

iis

the

num

ber

of

tota

l in

stru

ctio

ns

of

clas

s i

–C

PI i

is t

he

aver

age

CP

I fo

r in

stru

ctio

n c

lass

i

–n i

s th

e num

ber

of

inst

ruct

ion c

lass

es

–A

ccounts

for

the

wei

ght

and C

PI

of

each

inst

ruct

ion t

ype

•E

ffec

tiv

e C

PI

CP

I =

Clo

ck c

ycl

es /

Nu

mb

er o

f in

stru

ctio

ns

CPU Clock Cycles

=

Σ(CPI ix ICi)

i = 1

n

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

21

Example: Calculating CPI

•G

iven

the

foll

ow

ing C

PIs

for

each

inst

ruct

ion c

lass

and i

nst

ruct

ion

mix

es, w

hic

h c

ode

sequen

ce e

xec

ute

s fe

wer

inst

ruct

ions?

Whic

h i

sfa

ster

? W

hic

h h

as t

he

low

er C

PI?

•In

stru

ctio

ns:

Seq

1:

2+

1+

2 =

5 i

ns

Seq

2:

4+

1+

1 =

6in

s

•C

ycl

es:

Seq

1:

(2*1)+

(1*2)+

(2*3)

= 1

0 c

ycl

es

Seq

2:

(4*1)+

(1*2)+

(1*3)

= 9

cycl

es

•C

PI:

Seq

1:

10/5

= 2

.0

Seq

2:

9/6

= 1

.5

32

1C

PI

CB

AC

lass

11

42

21

21

CB

ASe

qu

en

ce S

equ

ence

1 h

as f

ewer

in

stru

ctio

ns

Seq

uen

ce 2

is

fast

er

Seq

uen

ce 2

has

th

e lo

wer

CP

I

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

22

“THE” Perform

ance Equation

•C

om

bin

ing

th

e fo

rmu

las

we

hav

e se

en:

Tim

e =

In

stru

ctio

n C

oun

t *

CP

I * C

ycl

e ti

me

Or S

eco

nd

s

In

stru

ctio

ns

*

Cy

cles

* S

eco

nd

sP

rog

ram

Pro

gra

m

In

stru

ctio

n

C

ycl

e=

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

23

“THE” Perform

ance Equation

�Separates the three key perform

ance factors

−Instructions, CPI, Clock rate

�Can help evaluate design decisions

−Known effects on these term

s can be translated into the

overall effect on perform

ance

�How can the values of these term

s be found?

−Time: by running the program

−Clock rate: published by computer manufacturer

−Instructions and CPI:

�Hardware perform

ance counters –CPU logic to record events

�Simulation of the system

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

24

Example: Perform

ance Equation

�A given application written in Java runs

for 15 seconds on a desktop processor.

A new Java compiler is released that

requires only 0.6 times as many

instructions as the old compiler.

Unfortunately, it increases the CPI by

1.1. How fast can we expect the

application to run using this new

compiler?

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

25

MIPS Perform

ance

•M

IPS

is

an a

lter

nat

ive

met

ric

for

per

form

ance

–M

illi

on

inst

ruct

ion

s p

er s

econd

MIP

S =

In

stru

ctio

ns

/ (T

ime

* 1

06

)=

Clo

ck R

ate

/ C

PI

•In

ver

sely

pro

port

ional

to e

xec

uti

on t

ime

–B

igg

er n

um

ber

s in

dic

ate

bet

ter

per

form

ance

–In

tuit

ive

repre

sen

tati

on

•3 s

ignif

ican

t pro

ble

ms

wit

h M

IPS

usa

ge

–D

oes

n’t

co

nsi

der

wh

at t

he

inst

ruct

ions

actu

ally

do

–V

arie

s b

y p

rogra

m;

no

sin

gle

nu

mb

er f

or

a m

ach

ine

–C

an v

ary i

nv

erse

ly w

ith

per

form

ance

! (e

xam

ple

to

fo

llo

w)

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

26

Example: MIPS Perform

ance

•G

iven

th

e fo

llo

win

g t

able

s o

f in

stru

ctio

n c

ou

nts

(in

bil

lio

ns)

and

C

PI

for

each

in

stru

ctio

n c

lass

, fi

nd

th

e M

IPS

an

d e

xec

uti

on

tim

es

on

a 4

GH

z m

ach

ine.

Cy

cles

1=

(5

*1

+ 1

*2

+ 1

*3

) *

10

9=

10

*1

09

cycl

es

Cy

cles

2=

(1

0*

1 +

1*

2 +

1*

3)

* 1

09

= 1

5*

10

9cy

cles

Tim

e 1=

10

*10

9/4

*1

09

= 2

.5 s

ec

and

Tim

e 2=

15

*1

09/4

*1

09

= 3

.75 s

ec

MIP

S1

= (

5+

1+

1)*

10

9/

2.5

*1

06

= 2

80

0

MIP

S2

= (

10+

1+

1)*

10

9/

3.7

5*1

06

= 3

20

0

32

1C

PI

CB

AC

lass

11

10

2

11

51

CB

ASe

qu

en

ce

Seque

nce1 is f

aste

r but

Seque

nce2

has h

igh

er

MIP

S

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

27

Determ

inates of Perform

ance

CPU time = Instruction_countx CPI x clock_cycle

XX

Algorithm

XX

Processor

organization

XX

XISA

XX

Compiler

XX

Programming

language

Clock_cycle

CPI

Instruction_

count

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

28

Determ

inates of Perform

ance

�Algorithm affects:

−Instruction count

�Determ

ines the number of source program

instructions, which affects the total number of

instruction executed

−CPI

�By favouring particular classes of instructions, the

algorithm can affect whether slower or faster

instructions are used

�For example can use m

ore floating point operations

and increase the CPI

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

29

Determ

inates of Perform

ance

�Programming language affects:

−Instruction count

�Statements in the language are translated to

processor instructions which determ

ine the

instruction count

−CPI

�The features of a programming language m

ay

influence the CPI because its features may translate

to slower or faster instructions

�For example indirect calls in Java are expensive

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

30

Determ

inates of Perform

ance

�Compiler affects:

−Instruction count

�The compiler determ

ines how to translate a high level

language into m

achine instructions, so it is directly

responsible

−CPI

�The compiler can translate the source code into high

or low CPI instructions, which affects the overall

average CPI

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

31

Determ

inates of Perform

ance

�Instruction Set Architecture affects:

−Instruction count

�The ISA determ

ines what instructions are available

which affects how m

any instructions are required to

perform

a task

−CPI

�The ISA can consist of fast or slow instructions for

different operations which determ

ines the CPI

−Clock rate

�The ISA determ

ines the amount of work of each

instruction which affects the clock rate

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

32

Determ

inates of Perform

ance

�Processor Organization affects:

−CPI

�The processor organization is the implementation of

instructions, so it determ

ines how long each

instruction will take to execute

−Clock rate

�The clock rate is affected by what work needs to be

done betw

een clock ticks, and this is determ

ined by

the processor organization

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

33

Examples: More CPI Calculations

•H

ow

mu

ch f

aste

r w

ou

ld t

he

mac

hin

e b

e if

a b

ette

r d

ata

cach

e re

du

ced t

he

aver

age

load

tim

e to

2 c

ycl

es?

•H

ow

do

es t

his

co

mp

are

wit

h u

sing

bra

nch

pre

dic

tio

n t

o s

hav

e a

cycl

e off

th

e bra

nch

tim

e?

•W

hat

if

two A

LU

ins

could

be

exec

ute

d a

t on

ce?

Σ=

220%

Branch

310%

Store

520%

Load

150%

ALU

Freq x CPI i

CPI i

Freq

Op

0.5

1.0

0.3

0.4 2.2

2.2

/1.6

m

eans 3

7.5

% f

aste

r

1.6

0.5

0.4

0.3

0.4

0.5

1.0

0.3

0.2

2.0

2.2

/2.0

m

eans 1

0%

faste

r

0.2

5

1.0

0.3

0.4

1.9

5

2.2

/1.9

5

mean

s 1

2.8

% f

aste

r

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

34

Now That We Understand Cycles

�A given program will require:

−Some number of instruction (machine ins)

−Some number of cycles

−Some number of seconds

�We now have the vocabulary to discuss how these

quantities relate to each other

−Cycle time (seconds/cycle)

−Clock rate (cycles/second)

−CPI (cycles/instruction)

−MIPS (Millions of instructions/second)

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

35

Perform

ance Traps

�Perform

ance is determ

ined by the execution time of the

program that you care about

�Do any of the other metrics equal perform

ance

−# of cycles to execute a program?

−# of instructions in a program?

−# of cycles per second?

−# of cycles per instruction?

−# of instructions per second (e.g. MIPS)?

�Common pitfall:

−Thinking that any one m

etric is representative of perform

ance

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

36

Evaluating Perform

ance

�How do we compare two computers, etc?

−Could rely on specs

�Difficult to interpret

�Not reliable betw

een architectures

�Better to actually run applications

−Execution time of applications can be m

etric

−These applications are called the w

orkload

�Ideal situation is a user with a fixed workload

−Run workload on both m

achines, compare times

−Find m

achine best for the user

−This situation is not very common

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

37

Workloads

�Perform

ance best determ

ined by running real applications

−Use programs typical of expected workload

−Pick workload containing expected properties

�Scientists will use scientific applications

�Software developers use compilers and word processors

�Select workload that emphasises the same resources

−On desktop systems

�CPU perform

ance

�DVD playback

�Graphics

−On server systems

�CPU perform

ance (scientific servers)

�File access times (file and web servers)

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

38

Benchmarks

�Bench

marksexist to represent many workloads

−Don’t have to find your own applications

−Generally agreed upon for reporting numbers

�Many popular bench

mark suitesto choose from

−SPEC (System Perform

ance Evaluation Corporation)

�CPU (INT+FP), W

eb, Mail, Java, Graphics, HPC

�Soon to publish Power and Virtualization benchmarks

−NAS (NASA Advanced Supercomputing)

�Parallel HPC applications

−EEMBC

�Embedded computing

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

39

Representative W

orkloads

�Five possible classes of applications:

−Real applications

−Modified applications (m

ost benchmarks are here)

−Kernels

−Toy benchmarks

−Synthetic benchmarks

�Each exists for different reasons

�Different reliability as benchmarks

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

40

Problems with Small Benchmarks

�Very attractive to designers

−Due to ease of compilation and sim

ulation (even by hand)

−Saves a lot of time in evaluation

−Easy to standardize

�Bad optimization target

−Can be trivially optimized

−Tailored to benchmark, not actual end-user workload

�Easily abused by designers

−Compiler optimizations enabled for specific app

−Generate possibly incorrect code for real app

�No excuse for their use on working computers

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

41

Reporting Results

�Reproducibilityis of the utm

ost importance

−List everything needed to duplicate the results

−Operating system, compiler, computer configuration

�Program input is also very important to consider

and to report

−Input can greatly affect program behavior

�Execute dominant vs. boundary cases

−Input size affects the perform

ance

�Larger inputs stress the m

emory system m

ore

�Must find representative input size

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

42

Summarizing Perform

ance

�Summarizing multiple results gives less info

−Preferable to have a single number

−Can simply compare single numbers betw

een

machines rather than complicated sets of numbers

�How do we come up with that single number?

−Obviously want it to be representative of perform

ance

−We’ve seen how complicated perform

ance can be

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

43

Comparing Perform

ance

•T

his

exam

ple

ill

ust

rate

s th

e d

iffi

cult

y

–C

om

pu

ter

A i

s 1

0 t

imes

fas

ter

than

B f

or

Pro

gra

m 2

–C

om

pu

ter

B i

s 1

0 t

imes

fas

ter

than

A f

or

Pro

gra

m 1

•S

imp

lest

app

roac

h:

use

to

tal

exec

uti

on t

ime

Per

form

ance

BT

ime A

10

01

Per

form

ance

AT

ime

B1

10

110

1001

Total

100

1000

Program 2

10

1Program 1

Computer B

Computer A

Time

==

=9

.1

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

44

Arithmetic Mean

�If applications will be run the same # of times

−Use the arithmetic mean

AM = 1/N * Σ

Tim

ei

�If applications will be run different # of times

−Use the w

eighted arith

metic mean

WAM = 1/N * Σ

Weight i* Timei

−Use weights corresponding to actual frequency

−Can choose weights to m

ake times equal on base

�AM is proportional to execution time

n

i =

1

n

i =

1

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

45

Geometric Mean

•T

he

geo

met

ric

mea

no

f a

coll

ecti

on

of

po

siti

ve

dat

a is

d

efin

ed a

s th

e n

th r

oot

of

the

pro

du

ct o

f al

l th

e m

emb

ers

of

the

dat

a se

t, w

her

e n

is t

he

nu

mb

er o

fm

emb

ers.

(W

ikip

edia

)

•S

ho

uld

be

use

d f

or

aver

agin

g n

orm

aliz

ed n

um

ber

s

–O

ver

com

es i

nco

nsi

sten

cies

of

AM

–W

ill

alw

ays

be

less

than

or

equal

to A

M

–S

ee:

“How

not

to l

ie w

ith s

tati

stic

s”

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

46

Example: Geometric Mean

(1.0

1)

(1.0

0)

AM

(0.8

6)

(1.0

0)

GM

368 (0.48)

772 (1.00)

E

33527 (0.85)

39449 (1.00)

D

153 (2.32)

66 (1.00)

C

70 (0.84)

83 (1.00)

B

244 (0.59)

417 (1.00)

A

CPU: M

CPU: R

Bench

33527 (1.00)

39449 (1.00)

D

368 (1.00)

772 (1.00)

E

(1.0

0)

(1.1

7)

GM

(1.0

0)

(1.3

2)

AM

153 (1.00)

66 (0.43)

C

70 (1.00)

83 (1.19)

B

244 (1.00)

417 (1.71)

A

CPU: M

CPU: R

Bench

•T

hes

e g

rap

hs

sho

w t

he

arit

hm

etic

mea

n (

AM

) an

d g

eom

etri

c m

ean

(G

M)

of

a se

t o

f n

orm

aliz

ed e

xec

uti

on

tim

es

–A

M y

ield

s num

ber

s in

dic

atin

g t

hat

eac

h C

PU

is

bet

ter

than

the

oth

er

–G

M s

ho

ws

CP

U M

bei

ng b

ette

r th

an C

PU

R i

n b

oth

cas

es

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

47

Example: Comparing Perform

ance

�Suppose you are choosing betw

een 4 desktops systems: an

Apple MacIntosh and 3 PC-compatible computers (Pentium 4,

Pentium 5, AMD). W

hich of the following are true?

�The fastest computer will be the one with the highest clock rate.

�The fastest PC will be the one with the highest clock rate.

�The fastest Intel will be the one with the highest clock rate.

�Must use benchmarks to ascertain the relative perform

ance for your

application workload.

Fals

e

Fals

e

Fals

e

Tru

e

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

48

Example: Comparing Perform

ance

•A

ssu

me

the

foll

ow

ing m

easu

rem

ents

wer

e ta

ken

:

•W

hic

h o

f th

e fo

llo

win

g s

tate

men

ts a

re t

rue:

–A

is

fast

er t

han

B f

or

pro

gra

m 1

.

–A

is

fast

er t

han

B f

or

pro

gra

m 2

.

–A

is

fast

er t

han

B f

or

a w

ork

load

wher

e pro

gra

m 1

and p

rogra

m

2 a

re e

xec

ute

d a

n e

qual

num

ber

of

tim

es.

–A

is

fast

er t

han

B f

or

a w

ork

load

wher

e pro

gra

m 1

is

exec

ute

d

twic

e as

oft

en a

s pro

gra

m 2

.

2 <

4,

so T

rue

2 sec

5 sec

2

4 sec

2 sec

1

Computer B

Computer A

Program

5 >

2,

so F

als

e

2+

5 >

4+

2, so F

als

e

2+

2+

5 <

4+

4+

2,

so T

rue

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

49

Evaluation using SPEC

�SPEC CPU benchmarks

−Measure CPU perform

ance

−12 INT and 14 FP applications (details next slide)

−Latest release (at book publishing) was CPU2000

�What metric do they report?

−Execution time of each application

−Sun 300MHz reported times divided by observed times

−This gives the SPEC ratio

−Bigger numbers indicate better perform

ance

−Geometric m

eans are reported for INT and FP

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

50

Pollutant distribution

apsi

Nuclear physics accel

sixtrack

Crash simulation fem

fma3d

Circuit place & route

twolf

Primality

testing

lucas

compression

bzip2

Computational chemistry

ammp

Object oriented database

vortex

Facial image recognition

facerec

Group theory interpreter

gap

Seismic wave propagation

simulation

equake

perlapplication

perlbmk

Image recognition (NN)

art

Computer visualization

eon

Computational fluid dynamics

galgel

Word processing program

parser

3D graphics library

mesa

Chess program

crafty

Parabolic/elliptic pde

applu

Combinatorial

optimization

mcf

Multigridsolver in 3D fields

mgrid

GNU C compiler

gcc

Shallow water model

swim

FPGA place & route

vpr

Quantum chromodynamics

wupwise

compression

gzip

FP benchmarks

Integer benchmarksSPEC CPU2000

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

51

SPEC CPU Example Use

�CPU perform

ance improvements come from:

−Increases in clock rate

−CPI lowering processor organization improvements

−Compiler enhancements (fewer or simpler ins)

�We will now compare the SPEC results from two

Intel processors

−Pentium III

−Pentium 4

−Varying the clock rate

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

52

Comparing P3 and P4

•L

inea

r p

erfo

rman

ce i

ncr

ease

s w

ith c

lock

rat

e

•N

ote

the

rela

tiv

e per

form

ance

of

FP

and

IN

T

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

53

Analysis of FP vs. INT

•S

o w

hy

do

es t

he

rela

tiv

e per

form

ance

of

FP

and

IN

T c

han

ge?

•W

e ca

n l

ook

at

the

SP

EC

rat

io o

ver

clo

ck r

ate

–P

4 h

as m

ore

ad

van

ced

IC

tec

hn

olo

gy

an

d m

ore

ag

gre

ssiv

e p

ipel

ine

stru

ctu

re

•In

crea

ses

clock

rat

e at

som

e ex

pen

se t

o C

PI

for

CIN

T

•H

ow

ever

, C

PI

is i

mpro

ved

for

CF

P

•H

appen

s bec

ause

new

inst

ruct

ions

wer

e in

troduce

d (

SS

E)

0.39

0.34

CFP2000

0.36

0.47

CINT2000

Pentium 4

Pentium III

Ratio

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

54

SPEC W

eb 99

�Focuses on throughput rather than response time

−Maximum number of connections supported

−Must m

aintain m

inimum perform

ance guarantee

−Multiprocessor systems perform

well

�SPECweb99 simply generates requests and records

the throughput

−Does not handle the requests

−This m

eans that the software is measured

�Results depend upon m

any system specs

−Disk drives, netw

ork, CPUs

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

55

SPEC W

eb 99 Results

•F

ocu

ses

on t

hro

ug

hpu

t ra

ther

than

res

pon

se t

ime

–A

lar

ger

nu

mb

er o

f sl

ow

er p

roce

sso

rs i

s bet

ter

–M

ore

dis

ks

and

net

wo

rk c

on

nec

tio

ns

are

imp

ort

ant

8001

80.7

87

8450

6700

82.0

48

6600

4200

40.7

45

6400

5698

43.06

25

2650

3435

41.13

28

2500

1810

11.4

23

1650

2765

21.0

22

1550

Result

#Nets

GHz

#CPUs

#Disks

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

56

Example: SPEC W

eb 99

�Which of the following uniprocessor Pentium

III configurations is likely to produce the best

perform

ance on SPECweb99?

a) 1.26 GHz processor, 1 disk, 1 netw

ork connection

b) 1.0 GHz processor, 6 disks, 3 netw

ork connections

c) 1.1 GHz processor, 2 disks, 2 netw

ork connections

Def

init

ely B

, bec

ause

dis

k a

nd

net

work

are

AT

LE

AS

T a

s im

po

rtan

tas

CP

U s

pee

d

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

57

Power Aware Computing

�Power is increasingly becoming a perform

ance limiting factor

−Passive cooling problems

−Battery power limitations

−Electricity costs

−Overheating in server systems

−Overheating in THIS laptop

�Popular solution is DVFS

−Dynamic Frequency and Voltage Scaling

−Reducing frequency hurts perform

ance

−Saves power proportional to the square of the perform

ance loss

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

58

Evaluating DVFS Perform

ance

•P

erfo

rman

ce c

har

acte

rist

ics

on

3 p

roce

sso

rs w

ith

DV

FS

•P

enti

um

M b

est

for

max

an

d

adap

tiv

e–

Slo

wer

pro

cess

or

–N

ewer

tec

hn

olo

gy

•P

enti

um

4 b

est

for

min

imu

m–

Mu

ch h

igh

er c

lock

rat

e

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

59

Evaluating DVFS Efficiency

•T

he

ener

gy e

ffic

ien

cy o

f D

VF

S o

n 3

pro

cess

ors

–P

erfo

rman

ce/P

ow

er C

on

sum

ed

•P

enti

um

M b

est

in a

ll c

ases

–D

esig

ned

fo

r en

erg

y e

ffic

ien

cy

fro

m t

he

star

t

–E

xtr

emel

y e

ffic

ien

t in

lo

wes

t D

VF

S ‘

gea

r’

•P

enti

um

III

bet

ter

than

4–

P4

has

in

her

entl

y e

ner

gy

In

effi

cien

t lo

gic

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

60

Amdahl’s Law

�Amdahl’s Law dictates the m

aximum perform

ance

improvement that can be seen by optimizing only part

of the system

Timeafter= Timenoopt+ Timeopt/Improvement

�The perform

ance improvement to be gained from using

some faster mode of execution is limited by the fraction

of the time the faster mode can be used

−Considers the amount of time the optimized region is used

−Considers the amount by which the region is optimized

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

61

Amdahl’s Law

•T

his

dia

gra

m i

llu

stra

tes

the

log

ic b

ehin

d A

md

ahl’

s L

aw

Ori

gin

al p

roce

ss

Mak

e B

5x f

aste

r

Mak

e A

2x f

aste

r

Tw

o i

ndep

enden

t par

ts:

Aan

d B

Eli

min

ate

Ben

tire

ly

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

62

Speedup

�Speedupis the ratio by which the perform

ance is

improved

Speedup = Perform

ance

orig/Perform

ance

opt

�E.g. improvement from 6 seconds to 5 seconds

would be a speedup of 6/5 = 1.2

�Amdahl’s Law limits speedup by finding the

minimum Perform

ance

opt

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

63

Example: Using Amdahl’s Law

�Suppose an enhancement to a processor used for Web serving that

is 10 times faster on computation than the original processor. If

the original CPU is busy with computation 40% of the time, what is

the overall speedup to be gained with the new processor?

Timeopt= 0.4 because 40% of the time is CPU

Timenoopt= 0.6 remaining time

Improvement = 10 because new CPU is 10x faster

Timeafter= 0.6 + 0.4/10 = 0.64

Speedupoverall= 1/0.64 = 1.56

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

64

Example: Using Amdahl’s Law

�Suppose a program runs in 100 seconds, and m

ultiply

operations account for 80% of this time. How m

uch

faster do we have to m

ake m

ultiplication to m

ake the

program 5 times faster?

Timeafter= Timenoopt+ Timeopt/Improvement

20 seconds= (100 –80) + 80 seconds/n

0 seconds = 80 seconds/n

There is no possible amount of optimization to accomplish this.

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

65

Law of Diminishing Returns

�What is the “law”of dim

inishing returns?

−The incremental improvements in speed gained by an

additional improvement in the perform

ance of just a portion of

the computation diminishes as improvements are added

�When does it apply?

−If you continue to optimize the same portion of execution,

minimal additional gains at each step

�E.g. continuing to cut execution time in half:

−1 second -> 0.5 -> 0.25 -> 0.125 -> …

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

66

Amdahl’s Law for Parallelization

•M

axim

um

po

ssib

le s

pee

du

p:

1F

+ (

1 –

F)/

N

–F

is

the

% o

f ex

ecuti

on n

ot

par

alle

liza

ble

–N

is

the

num

ber

of

pro

cess

ors

•D

emon

stra

tes

the

imp

ort

ance

o

f m

inim

izin

g F

•A

lso

sh

ow

s th

e d

imin

ish

ing

re

turn

s

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

67

Example: Amdahl’s Law

�Assume FP square root represents 20% of the time of a

graphics benchmark. One proposal might be to improve

FPSQ

R by a factor of 10. Alternatively, we m

ight im

prove FP

(which accounts for 50% of the tim

e) by a factor of 1.6. If

each of these optimizations requires the same cost, which

one is better?

SpeedupFPSQ

R=

11

1.22

(1 –0.2) + (0.2/10)

0.82

SpeedupFP

= 1

11.23

(1 –0.5) + (0.5/1.6)

0.8125

= ===

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

68

Make the Common Case Fast

�Very important concept in optimization

�Different events occur with varying frequencies

−Optimization opportunity limited by the frequencies

−Optimizing the rare cases is likely to have little effect

−Don’t waste time optimizing something that rarely happens

−Better return on investment to optimize frequent case

�Often the common case is simpler anyw

ay

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

69

Example: Complex Comparison

�Compare an optimization that decreases the CPI of FPSQ

R to 2.0 to one

that decreases the CPI of all FP operations to 2.5 given the following

statistics:

FP operation frequency: 25%

FPSQ

R frequency: 2%

Average CPI of FP operations: 4.0

Average CPI of FPSQ

R: 20.0

Average CPI of all other instructions: 1.33

CPI orig= (4*0.25) + (1.33*0.75) = 2.0

CPI newFPSQ

R= CPI orig–0.02*(20-2) = 1.64

CPI newFP= (2.5*0.25) + (1.33*0.75) = 1.625

Speedup of 1.64/1.625 = 1% for CPI newFPbecause all other factors held constant

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

70

Top500 List

�What is the Top500 list?

−Ranking of the 500 fastest computers in the world

−Published every 6 m

onths

�Fastest at what?

−LINPACK benchmark

−Linear algebra package

−Designed to indicate Peak speed

�Speed m

easured in FLOPS

−FLoating point Operations Per Second

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

71

Top500 List

43480

14,024

Jaguar

ORNL

10

47380

11,088

TSU

BAME

GSIC (Tokyo)

9

51870

10,160

Columbia

NASA/Ames

8

52840

9,968

Tera-10

CEA (France)

7

53000

9,024

Thunderbird

Sandia

6

62630

10,240

MareNostrum

BSC

5

75760

12,208

ASC Purple

LLN

L4

91290

40,960

BlueGeneW

IBM TJ Watson

3

101400

26,544

Red Storm

Sandia

2

280600

131,072

BlueGene/L

LLN

L1

Max

Processors

Computer

Site

Rank

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

72

Green500 List

�What is the Green500 list?

−Green as in environmental

−Counterpart to Top500 list that considers energy

−Still in development

−Founded by Dr. W

u-chunFenghere at VT

�How is it m

easured?

−FLOPS/Watt

−Quantifies energy efficiency

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

73

Green500 List

ASC Q

Earth Sim

ASC W

hite

ASC Purple

Columbia

Jaguar

MareNostrum

BlueGene/L

Computer

10,200

11,900

2,040

7,600

3,400

1,311

1,071

2,500

Power (kW)

1.36

3.01

3.58

9.97

15.26

32.67

58.23

112.24

MFlops/W

40

8

14

7

90

6

35

44

10

3

52

11

Top500 Rank

Rank

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

74

Price-Perform

ance Evaluation

�If both perform

ance and cost are known, can then

calculate price-perform

ance

−The “value”of the system

−How m

uch perform

ance you are getting for the price

�Ideally, we want the fastest system

−Realistically, we want the fastest for the price

�Price-perform

ance is simply:

−Perform

ance / Price

−Each m

etric m

easured in any units

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

75

Desktop Price-Perform

ance

$2,950

500

UltraSPARCII-e

Sunblade100

Sun

$13,889

450

IBM III-2

RS6000

IBM

$4,175

1,700

P4

Precision 530

Dell

$3,834

1,000

PIII

Precision 420

Dell

$2,091

1,400

Athlon

Presario 7000

Compaq

Price

Clock Rate

Processor

Model

Vendor

�Price variations due to several things

−Components (CPU, memory, hard drive)

−Expandability

−Commoditization

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

76

Desktop Price-Perform

ance (INT)

Price

-Per

form

ance

0

100

200

300

400

500

600

Com

paq

Pre

sario

700

0

Del

l Pre

cisi

on

530

Del

l Pre

cisi

on

420

IBM

RS

6000

Sun

Sun

blad

e

100

SPEC CPU Performance

050100

150

200

250

300

SPEC/$1Kdf

�Presario has best Price, Perform

ance, and Price-Perform

ance

�IBM RS6000 has highest Price and worst Price-Perform

ance

−Must be something we’re not seeing about the IBM RS6000

$2k

$4K

$4K

$14K

$3K

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

77

Price-P

erf

orm

ance (FP

)

0

100

200

300

400

500

600

700

Com

paq

Pre

sario 7

000

Dell

Pre

cis

ion

530

Dell

Pre

cis

ion

420

IBM

RS

6000

Sun S

unbla

de

100

SPEC CPU Performance d

050

100

150

200

250

SPEC/$1K d

Desktop Price-Perform

ance (FP)

�Improved FP perform

ance on P4 compared to PIII is clear

−Results in better value at the same price

�P4 outperform

s the AMD in FP also

−AMD still the better price-perform

ance system

$2k

$4K

$4K

$14K

$3K

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

78

Price-Perform

ance for OLTP

�One of the largest server markets is OLTP

−On-Line Transaction Processing

−Represented by the TPC-C benchmark

−Perform

ance m

easured in TPM –transactions per minute

�Many interesting things about TPC-C

−Very realistic

−Measures total system perform

ance

−Specific running rules

−Vendors m

ust report both perform

ance and price

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

79

Price-Perform

ance for TPC

Price-P

erf

orm

ance (TP

C-C

)

0

100

200

300

400

500

600

700

800

IBM

xS

eries

Com

paq

GS

320

Fujit

su 2

0000

IBM

pS

eries

HP

9000

IBM

iS

eries

TPC-C d

0510

15

20

25

30

35

40

45

50

TPM / $1K d

�Best price and price-perform

ance by the IBM xSeries

−This m

achine has 280 processors, second m

ost is 48 processors

�Other than the xSeries, they are pretty comparable

−pSeriesis slightly better power-perform

ance

$15M

$10M

$10M

$8M

$9M

$8M

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

80

Price-Perform

ance for TPC

Price-P

erf

orm

ance (TP

C-C

)

0

20

40

60

80

100

120

140

160

180

Dell

Pow

erE

dge

IBM

xS

eries

Com

paq

Pro

liant

HP

NetS

erv

erN

EC

Expre

ss

HP

9000

TPM / $1K d

010

20

30

40

50

60

TPC-C d

�Sm

allest and cheapest system has the best price-perform

ance

−Definitely NOT the best perform

ance though

�Compared to previous slide, much higher value

−Alm

ost 4 tim

es better price-perform

ance

−Much worse perform

ance than on previous slide (~10X)

$131K

$300K

$375K

$375K

$680K

$370K

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

81

Perform

ance Counters

�What are perform

ance counters?

−Logic inside the processor to record events

−Also known as hardware perform

ance m

onitors, hardware event

counters, and others

�Then what are events?

−Events are anything that happens inside the CPU

−E.g. stall cycles, bus accesses, cache m

isses

�What can they be used for?

−Provide insight into the interaction betw

een hardware and software

−Show where an application likely has bottlenecks

−Allows for optimizing a given application

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

82

Using Perform

ance Counters

�There are m

any existing HEC libraries

−PAPI, Perfctr, Perfmon, PACMAN(!), VTune, others…

−All have specific niches

�Provide “high-level”

access to counters

−Avoids having to deal with bitmasks and writing them to registers

�Published lists of reference values

−Can compare observed values for a given event to see if it is

likely to be a problem

−Then comes the hard part…

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

83

Using Perform

ance Counters

�Here is what the code often looks like:

start_recording(stall cycles, bus accesses);

//Perform whatever work you want to monitor…

work();

stop_recording(values);

for each entryin values: print entry;

�Real examples from PACMAN…

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

84

Simulation

�Model all of the components in software

−Rather than actually im

plementing the system in hardware

−Or at least the important components

−Write code to m

odel the registers, ALU, etc

−Different levels of detail possible

�Full system vs. Processor only

�Highly accurate vs. statistical approximation

�More detail is very computationally expensive

�Many popular simulators available

−SIMICS, SESC, Turandot, HotSpot, others

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

85

Simulation

�Tweak the parameters to m

atch your system

−Change the size of the caches, the branch predictor,

the number of registers, etc

−Cheaper than doing it in hardware

−Faster than doing it in hardware

�Simulator reports anything you want to know

−Execution time of the application

−Amount of time spent using various resources

−Number of accesses to each component

CS2

50

4,

Sp

rin

g'2

00

Dim

itri

s N

iko

lop

ou

los

86

Summary

�You learned:

−How to quantify the perform

ance of a system

�Form

ula for perform

ance

−How different metrics relate to overall perform

ance

�CPI vs. Instruction count vs. Clock frequency

�MIPS and FLOPS

−How to compare two systems using benchmarks

�Finding benchmarks that represent your workload