program profiling: applications, algorithms and tools

Program Profiling: Applications, Algorithms and Tools

Thomas BallMicrosoft Research

May 2001

Overview

Why profile? Applications

What to profile? Algorithms

How to profile? Infrastructure/Tools Future directions for profiling

Why Profile?

"If a given portion of a program has no observable effects, then you have no way of knowing if it is executing, if it has finished, if it got part way through and then stopped, or if it produced 'the right answer.' Programmers nearly always must rely on highly indirect measures to determine what happens when their programs execute. This is one reason why debugging is so difficult."

[Digital Woes, Lauren Ruth Weiner, 1993, Addison-Wesley]

Example I: Mystery Code

#include <stdio.h>main(t,_,a)char *a;{return!0<t?t<3?main(-79,-13,a+main(-87,1-_,main(-86,0,a+1)+a)):1,t<_?main(t+1,_,a):3,main(-94,-27+t,a)&&t==2?_<13?main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-72?main(_,t,"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w#q#n+,/#{l+,/n{n+,/+#n+,/#\;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l \q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#}w'){){nl]'/+#n';d}rw' i;# \){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw' iK{;[{nl]'/w#q#n'wk nw' \iwk{KK{nl]!/w{%'l##w#' i; :{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \;;{nl'-{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w! nr'/ ') }+}{rl#'{n' ')# \}'+}##(!!/"):t<-50?_==*a?putchar(31[a]):main(-65,_,a+1):main((*a=='/')+t,_,a+1):0<t?main(2,2,"%s"):*a=='/'||main(0,main(-61,*a,"!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m .vpbks,fxntdCeghiry"),a+1);}

Running the Program

On the first day of Christmas my true love gave to mea partridge in a pear tree.

On the second day of Christmas my true love gave to metwo turtle dovesand a partridge in a pear tree.

...

On the twelfth day of Christmas my true love gave to metwelve drummers drumming, eleven pipers piping, ten lords a-leaping,nine ladies dancing, eight maids a-milking, seven swans a-swimming,six geese a-laying, five gold rings;four calling birds, three french hens, two turtle dovesand a partridge in a pear tree.

Pretty Printed Code#include <stdio.h>main(t,_,a) char *a;{ if ((!0) < t) { if (t < 3) main(-79,-13,a+main(-87,1-_,main(-86,0,a+1)+a)); if (t < _ ) main(t+1,_,a); if (main(-94,-27+t,a)) { if (t==2 ) {

if ( _ < 13 ) { return main(2,_+1,"%s %d %d\n");} else { return 9;}

} elsereturn 16;

} else return 0;

...

PP Path Profiling Tool Instruments Sparc/Solaris executables

records intraprocedural paths based on EEL instrumentation technology

Usage: pp a.out // produces a.out.pp a.out.pp // produces a.out.Paths pp_stats a.out // produces path

statistics // from a.out.Paths file

Feb. 1997 8

How often does a control-flow path execute?

Levels of profiling: blocks statements & lines edges branches & blocks paths sequence of edges

& blocks

Example: Path Profiling

B C

D

E F

A343

400

57

Feb. 1997 9

Naive Path Profiling

put(“B”) put(“C”)

put(“D”)

put(“E”) put(“F”); record_path();

buffer

B C

D

E F

A put(“A”)A B D F

Feb. 1997 10

Efficient Path Profiling

B C

D

E F

A

r = 4

r = 0

r = 2

r += 1

count[r]++

Path EncodingACDF 0ACDEF 1ABCDF 2ABCDEF 3ABDF 4ABDEF 5

[Ball/Larus, MICRO 96]

11

Path Regeneration

1

4

2

P = 3

P = 3 P = 1

P = 1

P = 0 P = 0

Given path sum P, which path produced it?

w1 w2 w3

Exit

n1 n2 n3

0 n1n1+n2

v

B C

D

E F

A

Feb. 1997 12

PP Run-time Overhead

65%43%4%0%

99%

8%

78%

22%0%

96%

24%

23%

1%1%0%

17%

4%

28%

0

1

2

Benchmark

No

rmal

ized

Exe

cuti

on

Tim

e

PP

QPT2

% Hash

o 12 days of Christmaso 66 occurrences of non partridge-in-a-pear-tree giftso 114 strings printedo 2358 characters printed (wc –c)

What can we learn from a profile?

Profile partitions program into low and high frequency clusters of paths Number of verses vs. string searching

Profile identifies paths related to frequencies seen in analysis of program output Printing a string or character

Inefficiences pop out hidden O(n^2) algorithms

Example II: Profiling Bebop

Bebop performs reachability analysis of boolean programs

Uses symbolic version of [Reps-Horwitz-Sagiv, POPL’95] interprocedural data flow analysis Explicit representation of control flow Implicit representation of reachable states via BDDs

Complexity of algorithm is O( E 2n)

E = size of interprocedural control flow graph

n = max. number of variables in the scope of any label

Bebop

Exploits procedural abstraction number of globals bounded by g number of locals bounded by h O( E 2g+h ) = O(E)

Expect space usage and time for model checking to be linear in size of program

decl g;void main()begin level1(); level1(); if(!g) then reach: skip; else skip; fiend

void level<i>()begin decl a,b,c; if (g) then while(!a|!b|!c) do if (!a) then a := 1; elsif (!b) then a,b := 0,1; elsif (!c) then a,b,c := 0,0,1; else skip; fi od else <stmt>; <stmt>; fi g := !g;end

Simple Profiling Memory usage

BDD libraries report on peak space usage and many other useful statistics

Wall time cygwin “time bebop …”

Visual Studio profiler IceCap

internal Microsoft profiling

Peak Live BDD Nodes for T(N)

0

50000

100000

150000

200000

250000

0 200 400 600 800 1000

N

Pea

k S

pac

e fo

r T

(N)

Running time for T(N)

0

50

100

150

200

250

300

0 200 400 600 800 1000

N

Ru

nn

ing

tim

e fo

r T

(N)

(sec

on

ds)

CU

CMU

A Lesson: Profiling is critical when reusing code Eliminated various stupidities in our code

still had O(n2) time behavior! Profiling narrowed cause down to BDD

libraries assume small number of “managed” BDD

variables BDD operations generally are O(size of BDD) But, in both CMU and CU some BDD operations

loop through all managed variables, whether or not they appear in a BDD!

Overview

Why profile? Applications

What to profile? How to profile? Infrastructure/tools Future directions for profiling

Applications -> What to profile?

Control-flow profiles Trace scheduling [Ellis] Code positioning [Pettis/Hansen] Improving dataflow analysis [Ammons/Larus]

Value profiles superscalar architecture [Sodani/Sohi] method specialization [Dean, et al.]

Address profiles Improving D-cache performance [Calder, et al.]

Communication profiles Component placement [Hunt/Scott]

Control-flow Profiles and Optimization

Two main ideas: Optimize the hot paths (procedures,

etc.) superblocks

VLIW classic compiler optimizations dataflow analysis

Separate the hot from the cold affinity graph temporal relationship graph

Trace Scheduling for VLIW VLIW – Very Long Instruction Word

execute multiple instructions in single step need large basic blocks to fully utilize provided

“width” Problem: conditional branches Solution:

use edge/branch profiles to form “superblocks” schedule instructions within superblock generate compensation code to fix up state when

prediction is wrong early form of “superscalar+speculation” complexity pushed to compiler (IA-64)

Trace Scheduling:Code Motion Across Basic Blocks

a = b + c

if x > 10

d = a - 3

f = a * 3

g = d + 2

if x > 10

d = a - 3

f = a * 3

g = d + 2

a = b + c

a = b + c

Trace Scheduling Rules

If a trace op. moves below a conditional jump then place copy on off-trace edge of jump

A trace op. that writes to x can’t move above a conditional jump if x is live on off-trace edge of jump

etc.

Chang, Mahlke, Hwu IMPACT compiler Superblock formation

based on edge profiles and greedy algorithm

Tail duplication eliminate control-flow merges into middle of

superblock Classic compiler optimizations

constant prop., CSE, loop induction vars.,

Superblock formation

A

B

E

F

C

D

90

90

90

0

0

1

1

10

10

100

Tail Duplication

A

B

E

F

C

D

90

90

90

0

0

1

0-1

10

10

9-10

F’

0-1

89-90

Optimizations

x=c

y=x

z=x/a

Local opts. const/copy propagation CSE redundant load/store

removal Dead code removal Loop opts.

hoist loop. inv. code don’t hoist past conditional if

exception possible induction var. elimination

Pettis/Hansen: Profiled Guided Code Positioning

Goal reduce working set size, TLB misses

and I-cache misses Three techniques

procedure positioning basic block positioning procedure splitting

Procedure Positioning

Profile calls between procedures via link-time insertion of monitoring code

Construct undirected, weighted call graph

Greedy algorithm to merge nodes “closest is best” strategy if P calls Q frequently, want P and Q to reside

close to one another in executable Merge until one node left

Example

C

F E

D

A

H

B

G

1

410 1

3

1

1862

3 5

Example

C

F E

DA,D

H

B

G

1

71

1

1

862

3 5

Example

CC,F

E

DA,D

H

B

G

1

71

1

1

62

35

A-D-C-FA-D-F-CD-A-C-FD-A-F-C

C

F E

D

A

H

B

G

14 10 1

3

1

186 2

3 5

Basic Block Positioning Separate hot blocks from cold blocks

edge/branch profiles reorganize blocks so that “normal” control-

flow is straight-line code Create chains (superblocks)

consider edges in descending order of frequency, merging chains when possible

start with procedure entry, create chains by considering outgoing edge with highest frequency

Dataflow Analysis

Data flow functions Composition of

functions along a path Merge operators to

combine results from multiple paths

f

g

h

h o g o f o 0

0

Restructuring for Path-sensitive Data Flow

C D

E F

A B

C

E

A B

C D

E F

[Ammons, Larus]

Ammons/Larus Algorithm

Duplicate hot paths to eliminate merges into middle of hot paths as with superblock optimizations

Perform traditional DFA on new CFG

Compact the CFG merge duplicated CFG nodes that

have equivalent dataflow results


Control-flow profiles Value profiles

superscalar architecture [Sodani/Sohi] method specialization [Chambers, et

al.] Address profiles Communication profiles

Value Profiles Observation

many (static) instructions compute the same value over and over again!

Why? regularity in input data repeated traversal of structures

Applications architecture: cache results in buffer, to

eliminate redundant work compilers: partially evaluate code with

respect to commonly occurring constants

Example

s4 = search(l,4);s6 = search(l,6);

bool search(list* l, int i) { while(l != NULL) { if (l->val == i) return true; l = l->next; } return false;}

Some Numbers on Redundancy[Sodani-Sohi, profile produced with SimpleScalar simulator]

Selective Specialization for OO Languages [Dean/Chambers/Grove]

Naïve customization given method m accepting argument of

class A, superclass of B, superclass of C generate m w.r.t A, B and C code explosion

Specialization compile a method multiple times, based on

value/dynamic type of commonly passed parameters

use profiles to address cost/benefit

Technique Collects gprof-style information

weighted call graph node = message send edge = actual method receiver

Algorithm focuses on high-weight sends dynamic dispatch actual “pass-through” to formal


Control-flow profiles Value profiles Address profiles Communication profiles

Address Profiles Want to change location of objects to

minimize cache misses [Calder et al.] Addresses change with inputs How to name objects?

global variables stack variables heap objects

address of malloc site XOR’d with a few return addresses on stack [Barrett/Zorn, Lebeck/Wood]

Temporal Relationship Graph [Gloy, et al.]

Undirected, weighted graph nodes = objects (v,n,w) = n cache misses if v and w

were mapped to same cache location

o1 o2 o3 o2 o3 o1 o2 o3 o1Trace

Queue

o2 o1

o3 o2 o1

o1 o2 o3 o1 o1o2

o3

o3 o2 o1

o1 o2 o312

1

1

Object Placement Algorithm Similar in spirit to [Pettis/Hansen] Visit edges in TRG in decreasing

frequency determine placement of objects

minimize conflict misses based on placement of previously placed

objects uses original TRG

coalesce nodes, sum edge weights

Run-time optimization

Customized malloc computes “name” of malloc site

malloc address 4 return addresses

allocation bins to segregate objects with different names


Control-flow profiles Value profiles Address profiles Communication profiles

Component Placement [Hunt/Scott] COM

binary standard for interoperation forms basis of Microsoft products

DCOM transparent, distributed COM components based on RPC deep-copy semantics

Goal: take an application based on COM and split

into client/server using DCOM

Coign Binary rewriting to insert measurement

code at COM interfaces measures amount of data transmitted

between COM components (if they were distributed)

Analysis profile data forms communication graph component constraints min-cut algorithm

separates components into client & server harder for >2 machines

Overview

Why profile? Applications -> What to profile? Algorithms

How to profile? Infrastructure/tools Future directions for profiling

How to Profile? Main issues

precision vs. cost subsequent data analysis (inferences) perturbation

Instrumentation accurate frequency, moderate overhead theory of spanning trees and flow preservation low overhead path profiling [Bala/Duesterwald]

Sampling very low overhead Digital Continuous Profiling Infrastructure

Other techniques Run-time patching Hardware support Simulation

Edge Profiling via Spanning Trees

B C

D

E F

A

B C

D

E F

A ac++

bc++

bd++

de++

Recovering Missing Counts viaKirchoff’s Law

Known counts F(AC) = ac F(BC) = bc F(BD) = bd F(DE) = de

Inferred counts F(CD) = ac+bc F(AB)= bc+bd F(FA) = F(AB)+ac F(DF) = F(CD)+bd-de F(EF) = de

Bottom-up pass over spanning tree

B C

D

E F

A ac++

bc++

bd++

de++

Other Applications of Spanning Tree

Basic block profiling with block counters, edge counters, …

Counting events Path profiling Static estimation of frequencies Minimizing trace overhead

Path Profiling in Dynamo:Less = More Run-time optimization

On-line vs. off-line profiling Goal

predict what will be a hot path with little overhead

Technique profile block at head of path when head becomes hot, use incremental

instrumentation to find hot path Works very well

Profiling by Sampling: DCPI Modified (Alpha) kernel

device driver handles 1500 interrupts/sec. low overhead sampling

periodically samples PC 1-3% of CPU “moderate” amount of disk storage

records performance counters I-cache, D-cache, cycles, branch mispredictions,

… Lots of careful engineering to reduce

time spent (space used) per sample

Data Analysis What does large sample count for a PC

mean? high frequency lots of stalls

branch mispredict I-cache, D-cache miss static vs. dynamic?

DCPI generates for each instruction frequency CPI set of likely “culprits”

Estimating Frequency and CPI FI

frequency of instruction i CI

average number of cycles instruction i spent at head of instruction queue

SI sample count for instruction I

Approach SI FI * CI find Fi ,then obtain CI

Estimating FI

Identify basic blocks that are guaranteed to execute same number of times cycle-equivalence algorithm

Estimate frequency for blocks in same class

Propagate frequencies using spanning tree

Estimate MI

Estimate MI, the min. # of cycles I spends at head of issue queue based on scheduling basic blocks according

to processor model assumes some instruction in class(I) does

not incur stall FI SI /MI

Problems no/few samples in class(I) mutiple control predecessors dynamic stalls

Overview

Why profile? Applications -> What to profile? Algorithms

How to profile? Infrastructure/tools Future directions for profiling

Binary Modification EEL [Larus/Schnarr]

Sparc/Solaris Formed basis for

QPT: edge profiling and tracing PP: path profiling Wisconsin Wind Tunnel Tools

ATOM [Srivastava et al.] Alpha Simpler instrumentation interface than EEL Many tools

Vulcan [Srivastava et al.] x86/IA-64/MSIL Microsoft performance, coverage tools, code reorganization

DynInst [Hollingsworth] multiple platform, dynamic instrumentation

Performance Workbenches

SGI JInsight [IBM research]

sophisticated profiling for Java nice user interface

Paradyn [Miller, et al.] parallel performance measurement

Simulators

Shade [Cmelik/Keppel] SimOS [Rosenblum, et al.] SuperScalar [Austin] On-line cache simulation [Lebeck]

Much other work

Profiling of concurrent/distributed programs

Profiling of functional languages Static estimation of profiles (static

branch prediction) Formalization of profiling Profiling/garbage collection

Future Work

Profiling distributed computation Worldwide profiling Debugging with profiles

program profiling: applications, algorithms and tools

Documents