program profiling: applications, algorithms and tools
DESCRIPTION
Program Profiling: Applications, Algorithms and Tools. Thomas Ball Microsoft Research May 2001. Overview. Why profile? Applications What to profile? Algorithms How to profile? Infrastructure/Tools Future directions for profiling. Why Profile?. - PowerPoint PPT PresentationTRANSCRIPT
Program Profiling: Applications, Algorithms and Tools
Thomas BallMicrosoft Research
May 2001
Overview
Why profile? Applications
What to profile? Algorithms
How to profile? Infrastructure/Tools Future directions for profiling
Why Profile?
"If a given portion of a program has no observable effects, then you have no way of knowing if it is executing, if it has finished, if it got part way through and then stopped, or if it produced 'the right answer.' Programmers nearly always must rely on highly indirect measures to determine what happens when their programs execute. This is one reason why debugging is so difficult."
[Digital Woes, Lauren Ruth Weiner, 1993, Addison-Wesley]
Example I: Mystery Code
#include <stdio.h>main(t,_,a)char *a;{return!0<t?t<3?main(-79,-13,a+main(-87,1-_,main(-86,0,a+1)+a)):1,t<_?main(t+1,_,a):3,main(-94,-27+t,a)&&t==2?_<13?main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-72?main(_,t,"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w#q#n+,/#{l+,/n{n+,/+#n+,/#\;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l \q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#}w'){){nl]'/+#n';d}rw' i;# \){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw' iK{;[{nl]'/w#q#n'wk nw' \iwk{KK{nl]!/w{%'l##w#' i; :{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \;;{nl'-{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w! nr'/ ') }+}{rl#'{n' ')# \}'+}##(!!/"):t<-50?_==*a?putchar(31[a]):main(-65,_,a+1):main((*a=='/')+t,_,a+1):0<t?main(2,2,"%s"):*a=='/'||main(0,main(-61,*a,"!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m .vpbks,fxntdCeghiry"),a+1);}
Running the Program
On the first day of Christmas my true love gave to mea partridge in a pear tree.
On the second day of Christmas my true love gave to metwo turtle dovesand a partridge in a pear tree.
...
On the twelfth day of Christmas my true love gave to metwelve drummers drumming, eleven pipers piping, ten lords a-leaping,nine ladies dancing, eight maids a-milking, seven swans a-swimming,six geese a-laying, five gold rings;four calling birds, three french hens, two turtle dovesand a partridge in a pear tree.
Pretty Printed Code#include <stdio.h>main(t,_,a) char *a;{ if ((!0) < t) { if (t < 3) main(-79,-13,a+main(-87,1-_,main(-86,0,a+1)+a)); if (t < _ ) main(t+1,_,a); if (main(-94,-27+t,a)) { if (t==2 ) {
if ( _ < 13 ) { return main(2,_+1,"%s %d %d\n");} else { return 9;}
} elsereturn 16;
} else return 0;
...
PP Path Profiling Tool Instruments Sparc/Solaris executables
records intraprocedural paths based on EEL instrumentation technology
Usage: pp a.out // produces a.out.pp a.out.pp // produces a.out.Paths pp_stats a.out // produces path
statistics // from a.out.Paths file
Feb. 1997 8
How often does a control-flow path execute?
Levels of profiling: blocks statements & lines edges branches & blocks paths sequence of edges
& blocks
Example: Path Profiling
B C
D
E F
A343
400
57
Feb. 1997 9
Naive Path Profiling
put(“B”) put(“C”)
put(“D”)
put(“E”) put(“F”); record_path();
buffer
B C
D
E F
A put(“A”)A B D F
Feb. 1997 10
Efficient Path Profiling
B C
D
E F
A
r = 4
r = 0
r = 2
r += 1
count[r]++
Path EncodingACDF 0ACDEF 1ABCDF 2ABCDEF 3ABDF 4ABDEF 5
[Ball/Larus, MICRO 96]
11
Path Regeneration
1
4
2
P = 3
P = 3 P = 1
P = 1
P = 0 P = 0
Given path sum P, which path produced it?
w1 w2 w3
Exit
n1 n2 n3
0 n1n1+n2
v
B C
D
E F
A
Feb. 1997 12
PP Run-time Overhead
65%43%4%0%
99%
8%
78%
22%0%
96%
24%
23%
1%1%0%
17%
4%
28%
0
1
2
Benchmark
No
rmal
ized
Exe
cuti
on
Tim
e
PP
QPT2
% Hash
o 12 days of Christmaso 66 occurrences of non partridge-in-a-pear-tree giftso 114 strings printedo 2358 characters printed (wc –c)
What can we learn from a profile?
Profile partitions program into low and high frequency clusters of paths Number of verses vs. string searching
Profile identifies paths related to frequencies seen in analysis of program output Printing a string or character
Inefficiences pop out hidden O(n^2) algorithms
Example II: Profiling Bebop
Bebop performs reachability analysis of boolean programs
Uses symbolic version of [Reps-Horwitz-Sagiv, POPL’95] interprocedural data flow analysis Explicit representation of control flow Implicit representation of reachable states via BDDs
Complexity of algorithm is O( E 2n)
E = size of interprocedural control flow graph
n = max. number of variables in the scope of any label
Bebop
Exploits procedural abstraction number of globals bounded by g number of locals bounded by h O( E 2g+h ) = O(E)
Expect space usage and time for model checking to be linear in size of program
decl g;void main()begin level1(); level1(); if(!g) then reach: skip; else skip; fiend
void level<i>()begin decl a,b,c; if (g) then while(!a|!b|!c) do if (!a) then a := 1; elsif (!b) then a,b := 0,1; elsif (!c) then a,b,c := 0,0,1; else skip; fi od else <stmt>; <stmt>; fi g := !g;end
Simple Profiling Memory usage
BDD libraries report on peak space usage and many other useful statistics
Wall time cygwin “time bebop …”
Visual Studio profiler IceCap
internal Microsoft profiling
Peak Live BDD Nodes for T(N)
0
50000
100000
150000
200000
250000
0 200 400 600 800 1000
N
Pea
k S
pac
e fo
r T
(N)
Running time for T(N)
0
50
100
150
200
250
300
0 200 400 600 800 1000
N
Ru
nn
ing
tim
e fo
r T
(N)
(sec
on
ds)
CU
CMU
A Lesson: Profiling is critical when reusing code Eliminated various stupidities in our code
still had O(n2) time behavior! Profiling narrowed cause down to BDD
libraries assume small number of “managed” BDD
variables BDD operations generally are O(size of BDD) But, in both CMU and CU some BDD operations
loop through all managed variables, whether or not they appear in a BDD!
Overview
Why profile? Applications
What to profile? How to profile? Infrastructure/tools Future directions for profiling
Applications -> What to profile?
Control-flow profiles Trace scheduling [Ellis] Code positioning [Pettis/Hansen] Improving dataflow analysis [Ammons/Larus]
Value profiles superscalar architecture [Sodani/Sohi] method specialization [Dean, et al.]
Address profiles Improving D-cache performance [Calder, et al.]
Communication profiles Component placement [Hunt/Scott]
Control-flow Profiles and Optimization
Two main ideas: Optimize the hot paths (procedures,
etc.) superblocks
VLIW classic compiler optimizations dataflow analysis
Separate the hot from the cold affinity graph temporal relationship graph
Trace Scheduling for VLIW VLIW – Very Long Instruction Word
execute multiple instructions in single step need large basic blocks to fully utilize provided
“width” Problem: conditional branches Solution:
use edge/branch profiles to form “superblocks” schedule instructions within superblock generate compensation code to fix up state when
prediction is wrong early form of “superscalar+speculation” complexity pushed to compiler (IA-64)
Trace Scheduling:Code Motion Across Basic Blocks
a = b + c
if x > 10
d = a - 3
f = a * 3
g = d + 2
if x > 10
d = a - 3
f = a * 3
g = d + 2
a = b + c
a = b + c
Trace Scheduling Rules
If a trace op. moves below a conditional jump then place copy on off-trace edge of jump
A trace op. that writes to x can’t move above a conditional jump if x is live on off-trace edge of jump
etc.
Chang, Mahlke, Hwu IMPACT compiler Superblock formation
based on edge profiles and greedy algorithm
Tail duplication eliminate control-flow merges into middle of
superblock Classic compiler optimizations
constant prop., CSE, loop induction vars.,
Superblock formation
A
B
E
F
C
D
90
90
90
0
0
1
1
10
10
100
Tail Duplication
A
B
E
F
C
D
90
90
90
0
0
1
0-1
10
10
9-10
F’
0-1
89-90
Optimizations
x=c
y=x
z=x/a
Local opts. const/copy propagation CSE redundant load/store
removal Dead code removal Loop opts.
hoist loop. inv. code don’t hoist past conditional if
exception possible induction var. elimination
Pettis/Hansen: Profiled Guided Code Positioning
Goal reduce working set size, TLB misses
and I-cache misses Three techniques
procedure positioning basic block positioning procedure splitting
Procedure Positioning
Profile calls between procedures via link-time insertion of monitoring code
Construct undirected, weighted call graph
Greedy algorithm to merge nodes “closest is best” strategy if P calls Q frequently, want P and Q to reside
close to one another in executable Merge until one node left
Example
C
F E
D
A
H
B
G
1
410 1
3
1
1862
3 5
Example
C
F E
DA,D
H
B
G
1
71
1
1
862
3 5
Example
CC,F
E
DA,D
H
B
G
1
71
1
1
62
35
A-D-C-FA-D-F-CD-A-C-FD-A-F-C
C
F E
D
A
H
B
G
14 10 1
3
1
186 2
3 5
Basic Block Positioning Separate hot blocks from cold blocks
edge/branch profiles reorganize blocks so that “normal” control-
flow is straight-line code Create chains (superblocks)
consider edges in descending order of frequency, merging chains when possible
start with procedure entry, create chains by considering outgoing edge with highest frequency
Dataflow Analysis
Data flow functions Composition of
functions along a path Merge operators to
combine results from multiple paths
f
g
h
h o g o f o 0
0
Restructuring for Path-sensitive Data Flow
C D
E F
A B
C
E
A B
C D
E F
[Ammons, Larus]
Ammons/Larus Algorithm
Duplicate hot paths to eliminate merges into middle of hot paths as with superblock optimizations
Perform traditional DFA on new CFG
Compact the CFG merge duplicated CFG nodes that
have equivalent dataflow results
Applications -> What to profile?
Control-flow profiles Value profiles
superscalar architecture [Sodani/Sohi] method specialization [Chambers, et
al.] Address profiles Communication profiles
Value Profiles Observation
many (static) instructions compute the same value over and over again!
Why? regularity in input data repeated traversal of structures
Applications architecture: cache results in buffer, to
eliminate redundant work compilers: partially evaluate code with
respect to commonly occurring constants
Example
s4 = search(l,4);s6 = search(l,6);
bool search(list* l, int i) { while(l != NULL) { if (l->val == i) return true; l = l->next; } return false;}
Some Numbers on Redundancy[Sodani-Sohi, profile produced with SimpleScalar simulator]
Selective Specialization for OO Languages [Dean/Chambers/Grove]
Naïve customization given method m accepting argument of
class A, superclass of B, superclass of C generate m w.r.t A, B and C code explosion
Specialization compile a method multiple times, based on
value/dynamic type of commonly passed parameters
use profiles to address cost/benefit
Technique Collects gprof-style information
weighted call graph node = message send edge = actual method receiver
Algorithm focuses on high-weight sends dynamic dispatch actual “pass-through” to formal
Applications -> What to profile?
Control-flow profiles Value profiles Address profiles Communication profiles
Address Profiles Want to change location of objects to
minimize cache misses [Calder et al.] Addresses change with inputs How to name objects?
global variables stack variables heap objects
address of malloc site XOR’d with a few return addresses on stack [Barrett/Zorn, Lebeck/Wood]
Temporal Relationship Graph [Gloy, et al.]
Undirected, weighted graph nodes = objects (v,n,w) = n cache misses if v and w
were mapped to same cache location
o1 o2 o3 o2 o3 o1 o2 o3 o1Trace
Queue
o2 o1
o3 o2 o1
o1 o2 o3 o1 o1o2
o3
o3 o2 o1
o1 o2 o312
1
1
Object Placement Algorithm Similar in spirit to [Pettis/Hansen] Visit edges in TRG in decreasing
frequency determine placement of objects
minimize conflict misses based on placement of previously placed
objects uses original TRG
coalesce nodes, sum edge weights
Run-time optimization
Customized malloc computes “name” of malloc site
malloc address 4 return addresses
allocation bins to segregate objects with different names
Applications -> What to profile?
Control-flow profiles Value profiles Address profiles Communication profiles
Component Placement [Hunt/Scott] COM
binary standard for interoperation forms basis of Microsoft products
DCOM transparent, distributed COM components based on RPC deep-copy semantics
Goal: take an application based on COM and split
into client/server using DCOM
Coign Binary rewriting to insert measurement
code at COM interfaces measures amount of data transmitted
between COM components (if they were distributed)
Analysis profile data forms communication graph component constraints min-cut algorithm
separates components into client & server harder for >2 machines
Overview
Why profile? Applications -> What to profile? Algorithms
How to profile? Infrastructure/tools Future directions for profiling
How to Profile? Main issues
precision vs. cost subsequent data analysis (inferences) perturbation
Instrumentation accurate frequency, moderate overhead theory of spanning trees and flow preservation low overhead path profiling [Bala/Duesterwald]
Sampling very low overhead Digital Continuous Profiling Infrastructure
Other techniques Run-time patching Hardware support Simulation
Edge Profiling via Spanning Trees
B C
D
E F
A
B C
D
E F
A ac++
bc++
bd++
de++
Recovering Missing Counts viaKirchoff’s Law
Known counts F(AC) = ac F(BC) = bc F(BD) = bd F(DE) = de
Inferred counts F(CD) = ac+bc F(AB)= bc+bd F(FA) = F(AB)+ac F(DF) = F(CD)+bd-de F(EF) = de
Bottom-up pass over spanning tree
B C
D
E F
A ac++
bc++
bd++
de++
Other Applications of Spanning Tree
Basic block profiling with block counters, edge counters, …
Counting events Path profiling Static estimation of frequencies Minimizing trace overhead
Path Profiling in Dynamo:Less = More Run-time optimization
On-line vs. off-line profiling Goal
predict what will be a hot path with little overhead
Technique profile block at head of path when head becomes hot, use incremental
instrumentation to find hot path Works very well
Profiling by Sampling: DCPI Modified (Alpha) kernel
device driver handles 1500 interrupts/sec. low overhead sampling
periodically samples PC 1-3% of CPU “moderate” amount of disk storage
records performance counters I-cache, D-cache, cycles, branch mispredictions,
… Lots of careful engineering to reduce
time spent (space used) per sample
Data Analysis What does large sample count for a PC
mean? high frequency lots of stalls
branch mispredict I-cache, D-cache miss static vs. dynamic?
DCPI generates for each instruction frequency CPI set of likely “culprits”
Estimating Frequency and CPI FI
frequency of instruction i CI
average number of cycles instruction i spent at head of instruction queue
SI sample count for instruction I
Approach SI FI * CI find Fi ,then obtain CI
Estimating FI
Identify basic blocks that are guaranteed to execute same number of times cycle-equivalence algorithm
Estimate frequency for blocks in same class
Propagate frequencies using spanning tree
Estimate MI
Estimate MI, the min. # of cycles I spends at head of issue queue based on scheduling basic blocks according
to processor model assumes some instruction in class(I) does
not incur stall FI SI /MI
Problems no/few samples in class(I) mutiple control predecessors dynamic stalls
Overview
Why profile? Applications -> What to profile? Algorithms
How to profile? Infrastructure/tools Future directions for profiling
Binary Modification EEL [Larus/Schnarr]
Sparc/Solaris Formed basis for
QPT: edge profiling and tracing PP: path profiling Wisconsin Wind Tunnel Tools
ATOM [Srivastava et al.] Alpha Simpler instrumentation interface than EEL Many tools
Vulcan [Srivastava et al.] x86/IA-64/MSIL Microsoft performance, coverage tools, code reorganization
DynInst [Hollingsworth] multiple platform, dynamic instrumentation
Performance Workbenches
SGI JInsight [IBM research]
sophisticated profiling for Java nice user interface
Paradyn [Miller, et al.] parallel performance measurement
Simulators
Shade [Cmelik/Keppel] SimOS [Rosenblum, et al.] SuperScalar [Austin] On-line cache simulation [Lebeck]
Much other work
Profiling of concurrent/distributed programs
Profiling of functional languages Static estimation of profiles (static
branch prediction) Formalization of profiling Profiling/garbage collection
Future Work
Profiling distributed computation Worldwide profiling Debugging with profiles