program profiling: applications, algorithms and tools

71
Program Profiling: Applications, Algorithms and Tools Thomas Ball Microsoft Research May 2001

Upload: neola

Post on 04-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Program Profiling: Applications, Algorithms and Tools. Thomas Ball Microsoft Research May 2001. Overview. Why profile? Applications What to profile? Algorithms How to profile? Infrastructure/Tools Future directions for profiling. Why Profile?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Program Profiling: Applications, Algorithms and Tools

Program Profiling: Applications, Algorithms and Tools

Thomas BallMicrosoft Research

May 2001

Page 2: Program Profiling: Applications, Algorithms and Tools

Overview

Why profile? Applications

What to profile? Algorithms

How to profile? Infrastructure/Tools Future directions for profiling

Page 3: Program Profiling: Applications, Algorithms and Tools

Why Profile?

"If a given portion of a program has no observable effects, then you have no way of knowing if it is executing, if it has finished, if it got part way through and then stopped, or if it produced 'the right answer.' Programmers nearly always must rely on highly indirect measures to determine what happens when their programs execute. This is one reason why debugging is so difficult."

[Digital Woes, Lauren Ruth Weiner, 1993, Addison-Wesley]

Page 4: Program Profiling: Applications, Algorithms and Tools

Example I: Mystery Code

#include <stdio.h>main(t,_,a)char *a;{return!0<t?t<3?main(-79,-13,a+main(-87,1-_,main(-86,0,a+1)+a)):1,t<_?main(t+1,_,a):3,main(-94,-27+t,a)&&t==2?_<13?main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-72?main(_,t,"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w#q#n+,/#{l+,/n{n+,/+#n+,/#\;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l \q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#}w'){){nl]'/+#n';d}rw' i;# \){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw' iK{;[{nl]'/w#q#n'wk nw' \iwk{KK{nl]!/w{%'l##w#' i; :{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \;;{nl'-{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w! nr'/ ') }+}{rl#'{n' ')# \}'+}##(!!/"):t<-50?_==*a?putchar(31[a]):main(-65,_,a+1):main((*a=='/')+t,_,a+1):0<t?main(2,2,"%s"):*a=='/'||main(0,main(-61,*a,"!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m .vpbks,fxntdCeghiry"),a+1);}

Page 5: Program Profiling: Applications, Algorithms and Tools

Running the Program

On the first day of Christmas my true love gave to mea partridge in a pear tree.

On the second day of Christmas my true love gave to metwo turtle dovesand a partridge in a pear tree.

...

On the twelfth day of Christmas my true love gave to metwelve drummers drumming, eleven pipers piping, ten lords a-leaping,nine ladies dancing, eight maids a-milking, seven swans a-swimming,six geese a-laying, five gold rings;four calling birds, three french hens, two turtle dovesand a partridge in a pear tree.

Page 6: Program Profiling: Applications, Algorithms and Tools

Pretty Printed Code#include <stdio.h>main(t,_,a) char *a;{ if ((!0) < t) { if (t < 3) main(-79,-13,a+main(-87,1-_,main(-86,0,a+1)+a)); if (t < _ ) main(t+1,_,a); if (main(-94,-27+t,a)) { if (t==2 ) {

if ( _ < 13 ) { return main(2,_+1,"%s %d %d\n");} else { return 9;}

} elsereturn 16;

} else return 0;

...

Page 7: Program Profiling: Applications, Algorithms and Tools

PP Path Profiling Tool Instruments Sparc/Solaris executables

records intraprocedural paths based on EEL instrumentation technology

Usage: pp a.out // produces a.out.pp a.out.pp // produces a.out.Paths pp_stats a.out // produces path

statistics // from a.out.Paths file

Page 8: Program Profiling: Applications, Algorithms and Tools

Feb. 1997 8

How often does a control-flow path execute?

Levels of profiling: blocks statements & lines edges branches & blocks paths sequence of edges

& blocks

Example: Path Profiling

B C

D

E F

A343

400

57

Page 9: Program Profiling: Applications, Algorithms and Tools

Feb. 1997 9

Naive Path Profiling

put(“B”) put(“C”)

put(“D”)

put(“E”) put(“F”); record_path();

buffer

B C

D

E F

A put(“A”)A B D F

Page 10: Program Profiling: Applications, Algorithms and Tools

Feb. 1997 10

Efficient Path Profiling

B C

D

E F

A

r = 4

r = 0

r = 2

r += 1

count[r]++

Path EncodingACDF 0ACDEF 1ABCDF 2ABCDEF 3ABDF 4ABDEF 5

[Ball/Larus, MICRO 96]

Page 11: Program Profiling: Applications, Algorithms and Tools

11

Path Regeneration

1

4

2

P = 3

P = 3 P = 1

P = 1

P = 0 P = 0

Given path sum P, which path produced it?

w1 w2 w3

Exit

n1 n2 n3

0 n1n1+n2

v

B C

D

E F

A

Page 12: Program Profiling: Applications, Algorithms and Tools

Feb. 1997 12

PP Run-time Overhead

65%43%4%0%

99%

8%

78%

22%0%

96%

24%

23%

1%1%0%

17%

4%

28%

0

1

2

Benchmark

No

rmal

ized

Exe

cuti

on

Tim

e

PP

QPT2

% Hash

Page 13: Program Profiling: Applications, Algorithms and Tools

o 12 days of Christmaso 66 occurrences of non partridge-in-a-pear-tree giftso 114 strings printedo 2358 characters printed (wc –c)

Page 14: Program Profiling: Applications, Algorithms and Tools

What can we learn from a profile?

Profile partitions program into low and high frequency clusters of paths Number of verses vs. string searching

Profile identifies paths related to frequencies seen in analysis of program output Printing a string or character

Inefficiences pop out hidden O(n^2) algorithms

Page 15: Program Profiling: Applications, Algorithms and Tools

Example II: Profiling Bebop

Bebop performs reachability analysis of boolean programs

Uses symbolic version of [Reps-Horwitz-Sagiv, POPL’95] interprocedural data flow analysis Explicit representation of control flow Implicit representation of reachable states via BDDs

Complexity of algorithm is O( E 2n)

E = size of interprocedural control flow graph

n = max. number of variables in the scope of any label

Page 16: Program Profiling: Applications, Algorithms and Tools

Bebop

Exploits procedural abstraction number of globals bounded by g number of locals bounded by h O( E 2g+h ) = O(E)

Expect space usage and time for model checking to be linear in size of program

Page 17: Program Profiling: Applications, Algorithms and Tools

decl g;void main()begin level1(); level1(); if(!g) then reach: skip; else skip; fiend

void level<i>()begin decl a,b,c; if (g) then while(!a|!b|!c) do if (!a) then a := 1; elsif (!b) then a,b := 0,1; elsif (!c) then a,b,c := 0,0,1; else skip; fi od else <stmt>; <stmt>; fi g := !g;end

Page 18: Program Profiling: Applications, Algorithms and Tools

Simple Profiling Memory usage

BDD libraries report on peak space usage and many other useful statistics

Wall time cygwin “time bebop …”

Visual Studio profiler IceCap

internal Microsoft profiling

Page 19: Program Profiling: Applications, Algorithms and Tools

Peak Live BDD Nodes for T(N)

0

50000

100000

150000

200000

250000

0 200 400 600 800 1000

N

Pea

k S

pac

e fo

r T

(N)

Page 20: Program Profiling: Applications, Algorithms and Tools

Running time for T(N)

0

50

100

150

200

250

300

0 200 400 600 800 1000

N

Ru

nn

ing

tim

e fo

r T

(N)

(sec

on

ds)

CU

CMU

Page 21: Program Profiling: Applications, Algorithms and Tools

A Lesson: Profiling is critical when reusing code Eliminated various stupidities in our code

still had O(n2) time behavior! Profiling narrowed cause down to BDD

libraries assume small number of “managed” BDD

variables BDD operations generally are O(size of BDD) But, in both CMU and CU some BDD operations

loop through all managed variables, whether or not they appear in a BDD!

Page 22: Program Profiling: Applications, Algorithms and Tools

Overview

Why profile? Applications

What to profile? How to profile? Infrastructure/tools Future directions for profiling

Page 23: Program Profiling: Applications, Algorithms and Tools

Applications -> What to profile?

Control-flow profiles Trace scheduling [Ellis] Code positioning [Pettis/Hansen] Improving dataflow analysis [Ammons/Larus]

Value profiles superscalar architecture [Sodani/Sohi] method specialization [Dean, et al.]

Address profiles Improving D-cache performance [Calder, et al.]

Communication profiles Component placement [Hunt/Scott]

Page 24: Program Profiling: Applications, Algorithms and Tools

Control-flow Profiles and Optimization

Two main ideas: Optimize the hot paths (procedures,

etc.) superblocks

VLIW classic compiler optimizations dataflow analysis

Separate the hot from the cold affinity graph temporal relationship graph

Page 25: Program Profiling: Applications, Algorithms and Tools

Trace Scheduling for VLIW VLIW – Very Long Instruction Word

execute multiple instructions in single step need large basic blocks to fully utilize provided

“width” Problem: conditional branches Solution:

use edge/branch profiles to form “superblocks” schedule instructions within superblock generate compensation code to fix up state when

prediction is wrong early form of “superscalar+speculation” complexity pushed to compiler (IA-64)

Page 26: Program Profiling: Applications, Algorithms and Tools

Trace Scheduling:Code Motion Across Basic Blocks

a = b + c

if x > 10

d = a - 3

f = a * 3

g = d + 2

if x > 10

d = a - 3

f = a * 3

g = d + 2

a = b + c

a = b + c

Page 27: Program Profiling: Applications, Algorithms and Tools

Trace Scheduling Rules

If a trace op. moves below a conditional jump then place copy on off-trace edge of jump

A trace op. that writes to x can’t move above a conditional jump if x is live on off-trace edge of jump

etc.

Page 28: Program Profiling: Applications, Algorithms and Tools

Chang, Mahlke, Hwu IMPACT compiler Superblock formation

based on edge profiles and greedy algorithm

Tail duplication eliminate control-flow merges into middle of

superblock Classic compiler optimizations

constant prop., CSE, loop induction vars.,

Page 29: Program Profiling: Applications, Algorithms and Tools

Superblock formation

A

B

E

F

C

D

90

90

90

0

0

1

1

10

10

100

Page 30: Program Profiling: Applications, Algorithms and Tools

Tail Duplication

A

B

E

F

C

D

90

90

90

0

0

1

0-1

10

10

9-10

F’

0-1

89-90

Page 31: Program Profiling: Applications, Algorithms and Tools

Optimizations

x=c

y=x

z=x/a

Local opts. const/copy propagation CSE redundant load/store

removal Dead code removal Loop opts.

hoist loop. inv. code don’t hoist past conditional if

exception possible induction var. elimination

Page 32: Program Profiling: Applications, Algorithms and Tools

Pettis/Hansen: Profiled Guided Code Positioning

Goal reduce working set size, TLB misses

and I-cache misses Three techniques

procedure positioning basic block positioning procedure splitting

Page 33: Program Profiling: Applications, Algorithms and Tools

Procedure Positioning

Profile calls between procedures via link-time insertion of monitoring code

Construct undirected, weighted call graph

Greedy algorithm to merge nodes “closest is best” strategy if P calls Q frequently, want P and Q to reside

close to one another in executable Merge until one node left

Page 34: Program Profiling: Applications, Algorithms and Tools

Example

C

F E

D

A

H

B

G

1

410 1

3

1

1862

3 5

Page 35: Program Profiling: Applications, Algorithms and Tools

Example

C

F E

DA,D

H

B

G

1

71

1

1

862

3 5

Page 36: Program Profiling: Applications, Algorithms and Tools

Example

CC,F

E

DA,D

H

B

G

1

71

1

1

62

35

A-D-C-FA-D-F-CD-A-C-FD-A-F-C

C

F E

D

A

H

B

G

14 10 1

3

1

186 2

3 5

Page 37: Program Profiling: Applications, Algorithms and Tools

Basic Block Positioning Separate hot blocks from cold blocks

edge/branch profiles reorganize blocks so that “normal” control-

flow is straight-line code Create chains (superblocks)

consider edges in descending order of frequency, merging chains when possible

start with procedure entry, create chains by considering outgoing edge with highest frequency

Page 38: Program Profiling: Applications, Algorithms and Tools

Dataflow Analysis

Data flow functions Composition of

functions along a path Merge operators to

combine results from multiple paths

f

g

h

h o g o f o 0

0

Page 39: Program Profiling: Applications, Algorithms and Tools

Restructuring for Path-sensitive Data Flow

C D

E F

A B

C

E

A B

C D

E F

[Ammons, Larus]

Page 40: Program Profiling: Applications, Algorithms and Tools

Ammons/Larus Algorithm

Duplicate hot paths to eliminate merges into middle of hot paths as with superblock optimizations

Perform traditional DFA on new CFG

Compact the CFG merge duplicated CFG nodes that

have equivalent dataflow results

Page 41: Program Profiling: Applications, Algorithms and Tools

Applications -> What to profile?

Control-flow profiles Value profiles

superscalar architecture [Sodani/Sohi] method specialization [Chambers, et

al.] Address profiles Communication profiles

Page 42: Program Profiling: Applications, Algorithms and Tools

Value Profiles Observation

many (static) instructions compute the same value over and over again!

Why? regularity in input data repeated traversal of structures

Applications architecture: cache results in buffer, to

eliminate redundant work compilers: partially evaluate code with

respect to commonly occurring constants

Page 43: Program Profiling: Applications, Algorithms and Tools

Example

s4 = search(l,4);s6 = search(l,6);

bool search(list* l, int i) { while(l != NULL) { if (l->val == i) return true; l = l->next; } return false;}

Page 44: Program Profiling: Applications, Algorithms and Tools

Some Numbers on Redundancy[Sodani-Sohi, profile produced with SimpleScalar simulator]

Page 45: Program Profiling: Applications, Algorithms and Tools

Selective Specialization for OO Languages [Dean/Chambers/Grove]

Naïve customization given method m accepting argument of

class A, superclass of B, superclass of C generate m w.r.t A, B and C code explosion

Specialization compile a method multiple times, based on

value/dynamic type of commonly passed parameters

use profiles to address cost/benefit

Page 46: Program Profiling: Applications, Algorithms and Tools

Technique Collects gprof-style information

weighted call graph node = message send edge = actual method receiver

Algorithm focuses on high-weight sends dynamic dispatch actual “pass-through” to formal

Page 47: Program Profiling: Applications, Algorithms and Tools

Applications -> What to profile?

Control-flow profiles Value profiles Address profiles Communication profiles

Page 48: Program Profiling: Applications, Algorithms and Tools

Address Profiles Want to change location of objects to

minimize cache misses [Calder et al.] Addresses change with inputs How to name objects?

global variables stack variables heap objects

address of malloc site XOR’d with a few return addresses on stack [Barrett/Zorn, Lebeck/Wood]

Page 49: Program Profiling: Applications, Algorithms and Tools

Temporal Relationship Graph [Gloy, et al.]

Undirected, weighted graph nodes = objects (v,n,w) = n cache misses if v and w

were mapped to same cache location

o1 o2 o3 o2 o3 o1 o2 o3 o1Trace

Queue

o2 o1

o3 o2 o1

o1 o2 o3 o1 o1o2

o3

o3 o2 o1

o1 o2 o312

1

1

Page 50: Program Profiling: Applications, Algorithms and Tools

Object Placement Algorithm Similar in spirit to [Pettis/Hansen] Visit edges in TRG in decreasing

frequency determine placement of objects

minimize conflict misses based on placement of previously placed

objects uses original TRG

coalesce nodes, sum edge weights

Page 51: Program Profiling: Applications, Algorithms and Tools

Run-time optimization

Customized malloc computes “name” of malloc site

malloc address 4 return addresses

allocation bins to segregate objects with different names

Page 52: Program Profiling: Applications, Algorithms and Tools

Applications -> What to profile?

Control-flow profiles Value profiles Address profiles Communication profiles

Page 53: Program Profiling: Applications, Algorithms and Tools

Component Placement [Hunt/Scott] COM

binary standard for interoperation forms basis of Microsoft products

DCOM transparent, distributed COM components based on RPC deep-copy semantics

Goal: take an application based on COM and split

into client/server using DCOM

Page 54: Program Profiling: Applications, Algorithms and Tools

Coign Binary rewriting to insert measurement

code at COM interfaces measures amount of data transmitted

between COM components (if they were distributed)

Analysis profile data forms communication graph component constraints min-cut algorithm

separates components into client & server harder for >2 machines

Page 55: Program Profiling: Applications, Algorithms and Tools

Overview

Why profile? Applications -> What to profile? Algorithms

How to profile? Infrastructure/tools Future directions for profiling

Page 56: Program Profiling: Applications, Algorithms and Tools

How to Profile? Main issues

precision vs. cost subsequent data analysis (inferences) perturbation

Instrumentation accurate frequency, moderate overhead theory of spanning trees and flow preservation low overhead path profiling [Bala/Duesterwald]

Sampling very low overhead Digital Continuous Profiling Infrastructure

Other techniques Run-time patching Hardware support Simulation

Page 57: Program Profiling: Applications, Algorithms and Tools

Edge Profiling via Spanning Trees

B C

D

E F

A

B C

D

E F

A ac++

bc++

bd++

de++

Page 58: Program Profiling: Applications, Algorithms and Tools

Recovering Missing Counts viaKirchoff’s Law

Known counts F(AC) = ac F(BC) = bc F(BD) = bd F(DE) = de

Inferred counts F(CD) = ac+bc F(AB)= bc+bd F(FA) = F(AB)+ac F(DF) = F(CD)+bd-de F(EF) = de

Bottom-up pass over spanning tree

B C

D

E F

A ac++

bc++

bd++

de++

Page 59: Program Profiling: Applications, Algorithms and Tools

Other Applications of Spanning Tree

Basic block profiling with block counters, edge counters, …

Counting events Path profiling Static estimation of frequencies Minimizing trace overhead

Page 60: Program Profiling: Applications, Algorithms and Tools

Path Profiling in Dynamo:Less = More Run-time optimization

On-line vs. off-line profiling Goal

predict what will be a hot path with little overhead

Technique profile block at head of path when head becomes hot, use incremental

instrumentation to find hot path Works very well

Page 61: Program Profiling: Applications, Algorithms and Tools

Profiling by Sampling: DCPI Modified (Alpha) kernel

device driver handles 1500 interrupts/sec. low overhead sampling

periodically samples PC 1-3% of CPU “moderate” amount of disk storage

records performance counters I-cache, D-cache, cycles, branch mispredictions,

… Lots of careful engineering to reduce

time spent (space used) per sample

Page 62: Program Profiling: Applications, Algorithms and Tools

Data Analysis What does large sample count for a PC

mean? high frequency lots of stalls

branch mispredict I-cache, D-cache miss static vs. dynamic?

DCPI generates for each instruction frequency CPI set of likely “culprits”

Page 63: Program Profiling: Applications, Algorithms and Tools

Estimating Frequency and CPI FI

frequency of instruction i CI

average number of cycles instruction i spent at head of instruction queue

SI sample count for instruction I

Approach SI FI * CI find Fi ,then obtain CI

Page 64: Program Profiling: Applications, Algorithms and Tools

Estimating FI

Identify basic blocks that are guaranteed to execute same number of times cycle-equivalence algorithm

Estimate frequency for blocks in same class

Propagate frequencies using spanning tree

Page 65: Program Profiling: Applications, Algorithms and Tools

Estimate MI

Estimate MI, the min. # of cycles I spends at head of issue queue based on scheduling basic blocks according

to processor model assumes some instruction in class(I) does

not incur stall FI SI /MI

Problems no/few samples in class(I) mutiple control predecessors dynamic stalls

Page 66: Program Profiling: Applications, Algorithms and Tools

Overview

Why profile? Applications -> What to profile? Algorithms

How to profile? Infrastructure/tools Future directions for profiling

Page 67: Program Profiling: Applications, Algorithms and Tools

Binary Modification EEL [Larus/Schnarr]

Sparc/Solaris Formed basis for

QPT: edge profiling and tracing PP: path profiling Wisconsin Wind Tunnel Tools

ATOM [Srivastava et al.] Alpha Simpler instrumentation interface than EEL Many tools

Vulcan [Srivastava et al.] x86/IA-64/MSIL Microsoft performance, coverage tools, code reorganization

DynInst [Hollingsworth] multiple platform, dynamic instrumentation

Page 68: Program Profiling: Applications, Algorithms and Tools

Performance Workbenches

SGI JInsight [IBM research]

sophisticated profiling for Java nice user interface

Paradyn [Miller, et al.] parallel performance measurement

Page 69: Program Profiling: Applications, Algorithms and Tools

Simulators

Shade [Cmelik/Keppel] SimOS [Rosenblum, et al.] SuperScalar [Austin] On-line cache simulation [Lebeck]

Page 70: Program Profiling: Applications, Algorithms and Tools

Much other work

Profiling of concurrent/distributed programs

Profiling of functional languages Static estimation of profiles (static

branch prediction) Formalization of profiling Profiling/garbage collection

Page 71: Program Profiling: Applications, Algorithms and Tools

Future Work

Profiling distributed computation Worldwide profiling Debugging with profiles