copyright 2004 david j. lilja1 measurement tools and techniques fundamental strategies interval...
TRANSCRIPT
Copyright 2004 David J. Lilja 1
Measurement tools and techniques
Fundamental strategies Interval timers Program profiling Tracing Indirect measurement
Copyright 2004 David J. Lilja 2
Events
Most measurement tools based on events Some predefined change to system state
Definition depends on metric being measured Memory reference Disk access Change in a register’s state Network message Processor interrupt
Copyright 2004 David J. Lilja 3
Event Classification
Count metrics The number of times event X occurs Number of cache misses Number of I/O operations
Copyright 2004 David J. Lilja 4
Event Classification
Secondary-event metrics Record a value when triggered
by some event Record block size for each I/O
operation Count number of operations Find average I/O transfer size
Copyright 2004 David J. Lilja 5
Event Classification
Profiles Characterization of overall
behavior Aggregate/big picture view of an
application program Time spent in each function
Copyright 2004 David J. Lilja 6
Event-Driven Strategies
Record necessary information only when selected event occurs
Modify system to record event Dump data when program terminates
May need intermediate dumps also E.g. simple counter in page fault routine
Copyright 2004 David J. Lilja 7
Event-Driven Strategies
System overhead Only when the event of interest actually occurs Infrequent events → little perturbation Frequent events → high perturbation
No longer “typical” behavior? Perturbation changes system being measured
Copyright 2004 David J. Lilja 8
Event-Driven Strategies
Inter-event time is unpredictable Depends on when events actually occur Makes it hard to estimate perturbation How long to measure?
Event-driven measurement tools → Good for low-frequency events
Copyright 2004 David J. Lilja 9
Event-Driven Strategies
Counts 8 events exactly
+1 +1 +1 +1 +1 +1 +1 +1
Copyright 2004 David J. Lilja 10
Tracing
Similar to event-driven But record additional system state
Event has occurred – count Additional information to uniquely identify event E.g. addresses that cause page faults
Overhead Additional memory or disk storage Time to save state
Relatively large system perturbation
Copyright 2004 David J. Lilja 11
Tracing
Counts 8 events plus extra data
+1;Addr
+1;Addr
+1;Addr
+1;Addr
+1;Addr
+1;Addr
+1;Addr
+1;Addr
Copyright 2004 David J. Lilja 12
Sampling
Record necessary state at fixed time intervals Overhead
Independent of specific event frequency Depends on sampling frequency
Misses some events Produces statistical summary
May miss infrequent events Each replication will produce different results
Copyright 2004 David J. Lilja 13
Sampling
Counts 3 events out of 5 samples
+1 +1 +1
Copyright 2004 David J. Lilja 14
Comparisons
Event
countTracing Sampling
ResolutionExact count
Detailed info
Statistical
summary
Overhead Low High Constant
Perturbation ~ #events High Fixed
Copyright 2004 David J. Lilja 15
Comparison
Event counting Best for low frequency events Required if exact counts needed
Sampling Best for high frequency events If statistical summary is adequate
Tracing When additional detail is required
Copyright 2004 David J. Lilja 16
Indirect Measurements
Used when desired metric is not directly accessible
Measure one thing directly Derive or deduce desired metric
Highly dependent on creativity of performance analyst
Copyright 2004 David J. Lilja 17
Interval Timers
Fundamental tool of performance measurement
Measure execution time of any portion of a program
Provide time basis for sampling
Copyright 2004 David J. Lilja 18
Interval Timers
Actually count clock pulses between two events
Event 1 Event 2
x1= Counter x2 = Counter
Tc
Te=(x2 – x1)Tc
Copyright 2004 David J. Lilja 19
Using an Interval Timer
Within an application programStart_count = read_timer();
Portion of program to be measuredStop_count = read_timer();Elapsed_time = (stop_count – start_count) * clock_period;
Copyright 2004 David J. Lilja 20
Hardware Timer
Clock
n-bit counter
To CPU input port
Tc
Te=(x2 – x1)Tc
Copyright 2004 David J. Lilja 21
Software Timer
Te=(x2 – x1)Tc
Clock
T’c
Prescalar(divide-by-n)
Software counter
CPU interrupt inputTc
Copyright 2004 David J. Lilja 22
Quantization Errors
Copyright 2004 David J. Lilja 23
Quantization Error
Timer resolution
→ quantization error Repeated measurements
nTc < Te < (n+1)Tc
Te rounded to ± one clock tick
Completely unpredictable rounding → Want Tc to be as small as possible
Copyright 2004 David J. Lilja 24
Timer Rollover
n-bit counter count = [0, 2n-1]
Rollover = transition from (2n – 1) → 0 If rollover occurs between start/stop events
Then count = (x2 – x1) < 0
Check for count < 0 Measure again Add 2n to count
Copyright 2004 David J. Lilja 25
Timer Rollover
Counter width, n
Resolution
(Tc)
16 32 64
10 ns 655 us 43 s 58.5 cent
1 us 65.5 ms 1.2 h 5,580 cent
100 us 6.55 s 5 days 585,000 cent
1 ms 1.1 min 50 days 5,850,000 cent
Copyright 2004 David J. Lilja 26
Timer Overhead
Start_count = read_timer();Portion of program to be measured
Stop_count = read_timer();Elapsed_time = (stop_count – start_count) * clock_period;
To access timer Min of 1 memory read → subroutine call Min of 1 memory write → subroutine call
Once at start, again at stop
Copyright 2004 David J. Lilja 27
Timer Overhead
T1 T2 T3 T4
Eve
nt b
egi
ns;
Initi
ate
re
ad_
time
r()
Cu
rre
nt ti
me
act
ua
lly r
ea
d
Eve
nt e
nds
;In
itia
te r
ea
d_tim
e()
Cu
rre
nt ti
me
act
ua
lly r
ea
d
Eve
nt b
ein
g m
eas
ure
db
egin
s
Copyright 2004 David J. Lilja 28
Timer Overhead
T1 T2 T3 T4
T1 = time to read counter value
T2 = time to store counter value
T3 = time of the event we are measuring
T4 = time to read counter value T4 = T1
Copyright 2004 David J. Lilja 29
Timer Overhead
T1 T2 T3 T4
Te = event time = T3
But actually measured Tm = T2 + T3 + T4
Te = Tm – (T2 + T4) = Tm – (T1 + T2)
Timer overhead = Tovhd = (T1 + T2)
Copyright 2004 David J. Lilja 30
Timer Overhead
If Te >> Tovhd Ignore the timer overhead
If Te ≈ Tovhd
Measurements will be highly suspect
Potentially large variations in Tovhd Good rule of thumb
Te should be 100-1000x > Tovhd
Copyright 2004 David J. Lilja 31
Approximate Measures of Short Intervals
How to measure an event that is shorter than the resolution of the clock?
Cannot directly measure events with
Te < Tc
Overhead makes it hard to measure even when Te > nTc, n is small integer
Copyright 2004 David J. Lilja 32
Approximate Measures of Short Intervals
Tc
Te
Te
Case 1:Count+1
Case 2:Count+0
Copyright 2004 David J. Lilja 33
Approximate Measures of Short Intervals
Bernoulli experiment Outcome = +1 with probability p Outcome = +0 with probability (1-p) Equivalent to flipping a biased coin
Repeat n times Approximates a binomial distribution Only approximate since each measurement
cannot be guaranteed to be independent Usually close enough in practice
Copyright 2004 David J. Lilja 34
Approximate Measures of Short Intervals
m = number of times Case 1 occurs Count+1
n = total number of measurements Average duration is ratio of m/n Use confidence interval for proportions
ce Tn
mT
Copyright 2004 David J. Lilja 35
Example Clock resolution = 10 us n = 8764 measurements m = 467 clock ticks counted 95% confidence interval
10 us
?
?
Case 1:467
Case 2:8297
Copyright 2004 David J. Lilja 36
Example
)0580.0,0486.0(8764
8764467
18764467
96.18764
467),( 21
cc
Scale by clock period = 10 us 95% chance that measured event is
(0.49, 0.58) us
Copyright 2004 David J. Lilja 37
Profiling
Overall view of program’s execution-time behavior
Fraction of total time spent in specific states Fraction of time in each subroutine Fraction of time in OS kernel Fraction of time doing I/O
Find bottlenecks, code hot-spots Optimize those sections first
Copyright 2004 David J. Lilja 38
Statistical Sampling
Select a random subset of a population
Gather information on only this subset
Extrapolate this information to overall population
Results are a statistical summary with corresponding error probabilities
Copyright 2004 David J. Lilja 39
PC Sampling
Periodically interrupt program at fixed intervals Record appropriate state information in interrupt
service routine Post-process to obtain overall profile
+1 +1 +1
Copyright 2004 David J. Lilja 40
PC Sampling
At each interrupt Examine PC on return address stack Use address map to translate this PC to
subroutine i Increment array element H[i]
Addr map0-1298: Subr 11299-3455: Subr 23456-5567: Subr 35568-9943: Subr 4
PC: 4582 Histogramcounters:H[3]=H[3]+1
Copyright 2004 David J. Lilja 41
PC Sampling
0
20
40
60
80
100
120
140
H[1]
H[2]
H[3]
H[4]
H[5]
H[6]
H[7]
H[8]
H[9]
H[10]
H[11]
H[12]
Copyright 2004 David J. Lilja 42
PC Sampling
n total interrupts Post-processing step
H[i]/n = fraction of time executing in subroutine i (H[i]/n) * (interrupt period) = time in each
subroutine
Copyright 2004 David J. Lilja 43
PC Sampling
This is a statistical process Different counts each time the experiment is
performed Infer behavior of entire program from small
sample Apply confidence intervals to quantify
precision of results
Copyright 2004 David J. Lilja 44
Example 40 us interrupt 36,128 interrupts in subroutine A Program runs for 10 seconds Time in this subroutine?
90% confidence interval
m = 36,128 n = 10 sec / 40 us = 250,000 p = m/n = 0.144
Copyright 2004 David J. Lilja 45
Example
)146.0,144.0(250000
)855488.0(144512.0645.1144512.0),( 21
cc
90% chance that the program spent 14.4-14.6% of its time in subroutine A
Copyright 2004 David J. Lilja 46
Example 10 ms interrupt 12 interrupts in subroutine A n = 800 samples
8 seconds total execution time
Time in this subroutine? 99% confidence interval
p = m/n = 0.015
Copyright 2004 David J. Lilja 47
Example
)0261.0,0039.0(800
)015.01(015.0576.2015.0),( 21
cc
99% chance that the program spent 31-210 ms in subroutine A
A pretty wide range! But only <3% of total execution time Start optimizing somewhere else first
Copyright 2004 David J. Lilja 48
Reducing the Interval Size
Use a lower confidence level Obtain more samples
Run program longer May not be possible
Increase sample rate May be fixed by system Will increase overhead and perturbation
Run multiple times and add samples from each run
Copyright 2004 David J. Lilja 49
PC Sampling
Interrupts must occur asynchronously w.r.t. any program events Samples must be independent of each other Else over/under-sample events synchronous with interrupt
Periodic versus random sampling
+1 +1 +1
Copyright 2004 David J. Lilja 50
Basic Block Counting
Basic block Sequence of instructions with no branches into or
out of the block When first instruction is executed, guaranteed
that all instructions in block will be executed Single entry, single exit
Copyright 2004 David J. Lilja 51
Basic Block Counting
Generate a program profile by inserting additional instructions in each block Increment a unique counter each time a block is
entered Produces a histogram of program execution Can post-process to find instruction execution
frequencies
Copyright 2004 David J. Lilja 52
Comparison
PC samplingBasic block
counting
OutputStatistical estimate
Exact count
OverheadInterrupt service
routineExtra instructions
per block
PerturbationRandomly distributed
High
RepeatabilityWithin statistical
variancePerfect
Copyright 2004 David J. Lilja 53
Event Tracing
Profile shows overall frequency-of-execution behavior Ignores time-ordering of events
Program trace Dynamic list of events generated by program Events = anything you want to instrument
Sequence of memory addresses I/O blocks accessed
Typically used to drive a simulator
Copyright 2004 David J. Lilja 54
Trace Generation
Applicationprogram
Compress
Uncompress
Traceconsumer
Modify to generate trace
Copyright 2004 David J. Lilja 55
Trace Generation
Applicationprogram
Compress
Uncompress
Traceconsumer
Online traceconsumption
Modify to generate trace
Copyright 2004 David J. Lilja 56
Trace Generation
Source-code modification Allows precise control of what events are traced
and what data is recorded Typically a manual process
Sourcecode
Objectcode Proc TraceCompiler
Copyright 2004 David J. Lilja 57
Trace Generation
Software exceptions HW forces an exception before each instruction Exception routine decodes instruction
Store instr type, PC, operand addresses, etc. “Trace” bit in many processors Tremendous slowdown
Sourcecode
Objectcode Proc TraceCompiler
Copyright 2004 David J. Lilja 58
Trace Generation
Emulation Make a system appear to
be something else Modify emulator to
generate trace E.g. Java Virtual Machine
Sourcecode
Objectcode Proc TraceCompiler
Copyright 2004 David J. Lilja 59
Microcode modification Modify instruction execution directly Allows tracing of all instructions
Including operating system Depends on access to lower levels of the
processor E.g. Transmeta Crusoe processor
Trace Generation
Sourcecode
Objectcode Proc TraceCompiler
Copyright 2004 David J. Lilja 60
Trace Generation
Compiler modification Insert trace code directly in object file Requires access to the compiler itself
Sourcecode
Objectcode Proc TraceCompiler
Copyright 2004 David J. Lilja 61
Trace Generation
Compiler modification Insert trace code directly in object file Requires access to the compiler itself Write post-compilation binary editor/rewrite tool
Sourcecode
Objectcode Proc TraceCompiler
Copyright 2004 David J. Lilja 62
Trace Data
Tracing generates a tremendous volume of data
Trace 100,000,000 instrs/sec 16 bits of data per event 190 Mbytes of data per second
11 Gbytes per minute Huge perturbations
Due to tracing code Time to store trace data
Copyright 2004 David J. Lilja 63
Trace Data Compression
Standard compression algorithms as trace is written to disk
Uncompress when reading
Typical reduction 20-70%
Tradeoff is compress-uncompress time
Applicationprogram
Compress
Uncompress
Traceconsumer
Modify to generate trace
Copyright 2004 David J. Lilja 64
Online Trace Consumption
Use trace data as it is generated
Never stored on disk Multitasking may lead to
non-deterministic behavior Repeatability issue
Before-and-after comparison tests Difference due to change in
system or change in trace? Becomes statistical
comparison with n runs
Applicationprogram
Traceconsumer
Online traceconsumption
Modify to generate trace
Copyright 2004 David J. Lilja 65
Abstract Execution
Use higher-level information to intelligently compress trace info
Two-step process Compiler-style analysis to find critical subset of
trace Store only control flow information sufficient to
reconstruct trace later Produce trace-regeneration code for subsequent
use of trace
Copyright 2004 David J. Lilja 66
Abstract Execution
Trace will be either 1-2-4 1-3-4
Store only 2 or 3 Combine with compiler-
generated control flow graph to regenerate trace
Slowdown = 2-10x Compress = 10-100x
1. if (i > 5)2. then a = a + i;3. else b = b + i;4. i = i + 1;
1. if (i>5)
2. a=a+i 3. b=b+i
4. i=i+1
Copyright 2004 David J. Lilja 67
Trace Sampling Save only subsequences of overall trace Drive simulator with samples Results should be statistically similar to
driving with complete trace One sample = k consecutive events Sampling interval = P (period)
k k
P
Copyright 2004 David J. Lilja 68
SimPoint
Find “representative” program samples Match basic block execution frequencies Clustering tool to automate process
Perform detailed timing simulation on only these samples
Fast-forward (functional simulation) between samples
[Sherwood et al, ASPLOS, 2002]
Copyright 2004 David J. Lilja 69
SimPoint Weight each sample’s result by execution
frequency to produced overall result Relatively small number (10s) of SimPoints
produced ≈3% error in IPC on SPEC
Copyright 2004 David J. Lilja 70
SMARTS Uses systematic sampling
Fixed sample interval Apply statistical sampling techniques to
determine j, k, P
k
P
k
Functional simulation
Detailed simulation
jj
Copyright 2004 David J. Lilja 71
Indirect Ad Hoc Techniques
Sometimes the desired metric cannot be measured directly
Use your creativity to measure one thing and then derive/infer the desired value
Copyright 2004 David J. Lilja 72
Example – System Load
What is system load? Number of jobs in run queue? Number of jobs actively time-sharing? Fraction of time processor is not in idle loop? Others?
How to measure it? Modify OS PC sampling Indirect?
Copyright 2004 David J. Lilja 73
Example
Let system run for fixed time T Note value of counter
Monitor
Count
n
T
Copyright 2004 David J. Lilja 74
Example
Let system run for fixed time T Compare value of loaded system monitor
counter to unloaded system count value
Monitor
Monitor
App 1
Count
n
n/2
T
Copyright 2004 David J. Lilja 75
Example
Let system run for fixed time T Compare value of loaded system monitor
counter to unloaded system count value
Monitor
Monitor
App 1
App 1
App 2
Monitor
Count
n
n/2
n/3
T
Copyright 2004 David J. Lilja 76
Perturbation
To obtain more information (higher resolution) → Use more instrumentation points
More instrumentation points → Greater perturbation
Copyright 2004 David J. Lilja 77
Perturbation
Computer performance measurement uncertainty principle Accuracy is inversely proportional to
resolution.
Resolution
Acc
ura
cy
Low
High
High
Copyright 2004 David J. Lilja 78
Perturbation
Superposition does not work here Non-linear Non-additive
Double instrumentation ≠ double impact on performance Some instrumentation cancels out Some multiplies impact
No way to predict!
Copyright 2004 David J. Lilja 79
Instrumentation Code
Changes memory access patterns Affects memory banking optimizations
Generates additional load/store instructions More frequent cache flushes and replacements But may reduce set associativity conflicts
Generates more I/O operations Will increase overall execution time
More time-sharing context switches Alters virtual memory paging behavior
Copyright 2004 David J. Lilja 80
Important Points
Event types Simple counts of primary event Secondary events triggered by some primary
event Overall profiles
Copyright 2004 David J. Lilja 81
Important Points
Measurement strategies Event-driven Tracing Sampling Indirect approaches
Copyright 2004 David J. Lilja 82
Important Points
Interval timers Stopwatch functionality Rollover problem Overhead Quantization errors
Statistical measures of short intervals
Copyright 2004 David J. Lilja 83
Important Points
Profiling PC sampling
Statistical view Basic block counting
Exact behavior High overhead and perturbation
Copyright 2004 David J. Lilja 84
Important Points
Trace generation Source code modification Force exceptions Emulation Microcode modification Compiler modification Object code editor
Online trace consumption Trace sampling
Copyright 2004 David J. Lilja 85
Important Points
Indirect measurements when all else fails System load example
Perturbations Nobody likes them Have to learn to live with them