analyses and optimizations for multithreaded programs

Analyses and Optimizations for Multithreaded ProgramsMartin Rinard, Alex Salcianu,Brian Demsky

MIT Laboratory for Computer Science

John Whaley IBM Tokyo Research Laboratory

Motivation• Threads are Ubiquitous

• Parallel Programming for Performance• Manage Multiple Connections• System Structuring Mechanism

• Overhead• Thread Management• Synchronization

• Opportunities• Improved Memory Management

What This Talk is About• New Abstraction: Parallel Interaction Graph

• Points-To Information• Reachability and Escape Information • Interaction Information

•Caller-Callee Interactions•Starter-Startee Interactions

• Action Ordering Information• Analysis Algorithm• Analysis Uses (synchronization elimination,

stack allocation, per-thread heap allocation)

Outline• Example• Analysis Representation and Algorithm• Lightweight Threads• Results• Conclusion

Sum Sequence of Numbers9 8 1 5 3 7 2 6

Group in Subsequences9 8 1 5 3 7 2 6

Sum Subsequences (in Parallel)9 8 1 5 3 7 2 6

+

6

+

17

+

10

+

8

Add Sums Into Accumulator9 8 1 5 3 7 2 6

+

6

+

17

+

10

+

8

Accumulator0


+

6

+

17

+

10

+

8

Accumulator17


+

6

+

17

+

10

+

8

Accumulator23


+

6

+

17

+

10

+

8

Accumulator33


+

6

+

17

+

10

+

8

Accumulator41

Common Schema• Set of tasks• Chunk tasks to increase granularity• Tasks have both

• Independent computation• Updates to shared data

Realization in Javaclass Accumulator { int value = 0; synchronized void add(int v) { value += v; }}

Realization in Javaclass Task extends Thread { Vector work; Accumulator dest; Task(Vector w, Accumulator d) { work = w; dest = d; }

public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); }}

0work dest

Task

62

AccumulatorVector

Realization in Javaclass Task extends Thread { Vector work; Accumulator dest; Task(Vector w, Accumulator d) { work = w; dest = d; }

public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum); }}

0work dest

Task

62

AccumulatorVector

Enumeration

Realization in Javavoid generateTask(int l, int u, Accumulator a) { Vector v = new Vector(); for (int j = l; j < u; j++) v.addElement(new Integer(j)); Task t = new Task(v,a); t.start();}void generate(int n, int m, Accumulator a) { for (int i = 0; i < n; i ++) generateTask(i*m, i*(m+1),

a);}

Accumulator0

Task Generation

AccumulatorVector

0

Task Generation

AccumulatorVector

0

Task Generation

2

62

AccumulatorVector

0

Task Generation

work destTask

62

AccumulatorVector

0

Task Generation

work destTask

62

AccumulatorVector

0

98

Vector

Task Generation

work destTask

62

AccumulatorVector

0

workdest

Task

98

Vector

Task Generation

work destTask

62

AccumulatorVector

0

workdest

Task

98

Vector

work

dest

Task

51

Vector

Task Generation

Analysis

Analysis Overview• Interprocedural• Interthread • Flow-sensitive

• Statement ordering within thread• Action ordering between threads

• Compositional, Bottom Up• Explicitly Represent Potential Interactions

Between Analyzed and Unanalyzed Parts• Partial Program Analysis

Analysis Result for run Method

Accumulator

public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum);}•Abstraction: Points-to Graph

•Nodes Represent Objects•Edges Represent References

work destTask

Vector

Enumeration

this


Accumulator

public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum);}•Inside Nodes

•Objects Created Within Current Analysis Scope

•One Inside Node Per Allocation Site

•Represents All Objects Created At That Site

work destTask

Vector

Enumeration

this


Accumulator

public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum);}•Outside Nodes

•Objects Created Outside Current Analysis Scope

•Objects Accessed Via References Created Outside Current Analysis Scope

work destTask

Vector

Enumeration

this


Accumulator

public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum);}•Outside Nodes

•One per Static Class Field •One per Parameter•One per Load Statement

• Represents Objects Loaded at That Statement

work destTask

Vector

Enumeration

this


Accumulator

public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum);}•Inside Edges

•References Created Inside Current Analysis Scope

work destTask

Vector

Enumeration

this


Accumulator

public void run() { int sum = 0; Enumeration e = work.elements(); while (e.hasMoreElements()) sum += ((Integer) e.nextElement()).intValue(); dest.add(sum);}•Outside Edges

•References Created Outside Current Analysis Scope

•Potential Interactions in Which Analyzed Part Reads Reference Created in Unanalyzed Part

work destTask

Vector

Enumeration

this

Concept of Escaped Node• Escaped Nodes Represent Objects

Accessible Outside Current Analysis Scope• parameter nodes, load nodes• static class field nodes• nodes passed to unanalyzed methods• nodes reachable from unanalyzed but

started threads• nodes reachable from escaped nodes

• Node is Captured if it is Not Escaped

Why Escaped Concept is Important

• Completeness of Analysis Information• Complete information for captured nodes• Potentially incomplete for escaped nodes

• Lifetime Implications• Captured nodes are inaccessible when

analyzed part of the program terminates• Memory Management Optimizations

•Stack allocation •Per-Thread Heap Allocation

Intrathread Dataflow Analysis• Computes a points-to escape graph for

each program point• Points-to escape graph is a pair <I,O,e>

• I - set of inside edges• O - set of outside edges• e - escape information for each node

Dataflow Analysis• Initial state:

I : formals point to parameter nodes,

classes point to class nodesO: Ø

• Transfer functions:I´ = (I – KillI ) U GenI

O´ = O U GenO

• Confluence operator is U

Intraprocedural Analysis• Must define transfer functions for:

• copy statement l = v• load statement l1 = l2.f• store statement l1.f = l2

• return statement return l• object creation site l = new cl• method invocation l = l0.op(l1…lk)

copy statement l = vKillI = edges(I, l)GenI = {l} × succ(I, v)I´ = (I – KillI ) U GenI

l

v

Existing edges

copy statement l = vKillI = edges(I, l)GenI = {l} × succ(I, v)I´ = (I – KillI ) U GenI

Generated edges

l

v

load statement l1 = l2.fSE = {n2 in succ(I, l2) . escaped(n2)}SI = U{succ(I, n2, f) . n2 in succ(I, l2)}case 1: l2 does not point to an escaped node (SE = Ø)

KillI = edges(I, l1)GenI = {l1} × SI

l1

l2

Existing edges

f

load statement l1 = l2.fSE = {n2 in succ(I, l2) . escaped(n2)}SI = U{succ(I, n2, f) . n2 in succ(I, l2)}case 1: l2 does not point to an escaped node (SE = Ø)

KillI = edges(I, l1)GenI = {l1} × SI

Generated edges

l1

l2

f

load statement l1 = l2.fcase 2: l2 does point to an escaped node (not SE = Ø)

KillI = edges(I, l1)GenI = {l1} × (SI U {n})GenO = (SE × {f}) × {n}where n is the load node for l1 = l2.f

l1

l2

Existing edges

load statement l1 = l2.fcase 2: l2 does point to an escaped node (not SE = Ø)

KillI = edges(I, l1)GenI = {l1} × (SI U {n})GenO = (SE × {f}) × {n}where n is the load node for l1 = l2.f

Generated edges

l1

l2

nf

store statement l1.f = l2

GenI = (succ(I, l1) × {f}) × succ(I, l2)I´ = I U GenI

l2

Existing edges

l1

store statement l1.f = l2

GenI = (succ(I, l1) × {f}) × succ(I, l2)I´ = I U GenI

Generated edges

l2

l1f

object creation site l = new clKillI = edges(I, l)GenI = {<l, n>}where n is inside node for l = new cl

l

Existing edges

object creation site l = new clKillI = edges(I, l)GenI = {<l, n>}where n is inside node for l = new cl

Generated edges

l n

Method Call• Analysis of a method call:

• Start with points-to escape graph before the call site

• Retrieve the points-to escape graph from analysis of callee

• Map outside nodes of callee graph to nodes of caller graph

• Combine callee graph into caller graph• Result is the points-to escape graph after

the call site

v

t

a

Points-to Escape Graphbefore call to

t = new Task(v,a)

Start With Graph Before Call

work

dest

v

t

a

this

w

d


t = new Task(v,a)

Points-to Escape Graphfrom analysis of

Task(w,d)

Retrieve Graph from Callee

work

dest

v

t

a

this

w

d


t = new Task(v,a)


Task(w,d)

Map Parameters from Callee to Caller

work

dest

v

t

a

this

w

d

Combined Graphafter call to

t = new Task(v,a)


Task(w,d)

Transfer Edges from Callee to Caller

work

dest

v

t

a


t = new Task(v,a)

Discard Parameter Nodes from Callee

work

dest


x.foo()


foo()

thisx

More General Example

yz


x.foo()


foo()

thisx

Initialize MappingMap Formals to Actuals

yz


x.foo()


foo()

thisx

Extend MappingMatch Inside and Outside Edges

y

Mapping is UnidirectionalFrom Callee to Caller

z


x.foo()


foo()

thisx

Complete Mapping Automap Load and Inside Nodes Reachable

from Mapped Nodes

yz


x.foo()


foo()

thisx

Combine MappingProject Edges from Callee Into Combined

Graph

yz


x.foo()

x

Discard Callee Graph

z


x.foo()

x

Discard Outside Edges From Captured Nodes

z

Interthread Analysis• Augment Analysis Representation

• Parallel Thread Set• Action Set (read,write,sync,create edge)• Action Ordering Information (relative to

thread start actions)• Thread Interaction Analysis

• Combine points-to graphs• Induces combination of other information

• Can perform interthread analysis at any point to improve precision of results

Points-to Escape Graphsometime after call to

x.start()


run()

Combining Points-to Graphs

x this


x.start()


run()

Initialize MappingMap Startee Thread to Starter

Thread

x this


x.start()


run()


x this


x.start()


run()


x this

Mapping is BidirectionalFrom Startee to StarterFrom Starter to Startee


x.start()


run()

Complete Mapping Automap Load and Inside Nodes Reachable from Mapped Nodes

x this

Combined Points-to Escape Graph sometime after call to

x.start()

Combine GraphsProject Edges Through Mappings Into

Combined Graph

x this


x.start()

Discard StarteeThread Node

x this


x.start()

Discard Startee Thread Node

x


x.start()

Discard Outside Edges From Captured Nodes

x

Life is not so Simple• Dependences between phases• Mapping best framed as constraint

satisfaction problem• Solved using constraint satisfaction

algorithm

Interthread Analysis With Actions and Ordering

Accumulatorb e

awork dest

Task

d

c

Vector

ta

ParallelThreads

Actions

wr awr bwr cwr d

sync brd b

Points-to Graph

Action Ordering

“All actionshappen before

thread a starts

executing”

Analysis Result for generateTask

6Enumeration

Accumulator2 5

1work dest

Task

4

3

Vector

this

ParallelThreads

Actions

rd 1rd 2rd 3rd 4

Action Ordering

noparallelthreads

none

rd 5

wr 5

sync 2

rd 6

wr 6

Points-to Graph

Analysis Result for run

sync 5edge(1,2)edge(1,5)edge(2,3)edge(3,4)

Role of edge(1,2) Actions• One edge action for each outside edge• Action order for edge actions improves

precision of interthread analysis• If starter thread reads a reference

before startee thread is started• Then reference was not created by

startee thread• Outside edge actions record order• Inside edges from startee matched only

against parallel outside edges


x.start()


run()

Edge Actions in Combining Points-to Graphs

1

2

3

x this

Action Ordering

edge(1,2) || 1


x.start()


run()

Edge Actions in Combining Points-to Graphs

1

2

3

x this

Action Ordering

(i.e., edge(1,2)created before

started)1

none

Accumulatorb e

awork dest

Task

d

c

Vector

t

ParallelThreads

Actions

wr awr bwr cwr d

sync brd b

Points-to Graph

Action Ordering

“All actions from

current threadhappen before

thread a starts

executing”

Analysis Result After Interaction

rd a, ard b, ard c, ard d, ard e, awr e, a

sync b, async e, a

a

Roles of Intrathread and Interthread Analyses

• Basic Analysis• Intrathread analysis delivers parallel

interaction graph at each program point•records parallel threads•does not compute thread interaction

• Choose program point (end of method)• Interthread analysis delivers additional

precision at that program point• Does not exploit ordering information from

thread join constructs

Join Ordering

t = new Task();t.start();

“computation that runs in parallel with task t”

t.join();

“computation that runs after task t”

t.run();“computation

from task t”

Exploiting Join Ordering• At join point

• Interthread analysis delivers new (more precise) parallel interaction graph

• Intrathread analysis uses new graph• No parallel interactions between

• Thread• Computation after join

Extensions• Partial program analysis

• can analyze method independent of callers

• can analyze method independent of methods it invokes

• can incrementally analyze callees to improve precision

• Dial down precision to improve efficiency• Demand-driven formulations

Key Ideas• Explicitly represent potential

interactions between analyzed and unanalyzed parts• Inside versus outside nodes and edges• Escaped versus captured nodes• Precisely bound ignorance

• Exploit ordering information• intrathread (flow sensitive)• interthread (starts, edge orders, joins)

Analysis Uses

Overheads in Standard Execution and How to Eliminate Them

6Enumeration

Accumulator2 5

1work dest

Task

4

3

Vector

this

Intrathread Analysis Result from End of run Method

•Enumeration object is captured•Does not escape to caller•Does not escape to parallel

threads•Lifetime of Enumeration object

is bounded by lifetime of run•Can allocate Enumeration

object on call stack instead of heap

Accumulatorb e

awork dest

Task

d

c

Vector

t

ParallelThreads

Actions

wr a

wr b

wr c

wr d

sync b

rd b

Points-to Graph

Action Ordering

“All actions from current thread happen before

thread a startsexecuting”rd a, a

rd b, a

rd c, a

rd d, a

rd e, a

wr e, a

sync b, a

sync e, a

a

•Vector object is captured•Multiple threads synchronize on

Vector object•But synchronizations from different

threads do not occur concurrently•Can eliminate synchronization on

Vector object

Interthread Analysis Result from End of generateTask Method

Accumulatorb e

awork dest

Task

d

c

Vector

t

ParallelThreads

Actions

wr a

wr b

wr c

wr d

sync b

rd b

Points-to Graph

Action Ordering

“All actions from current thread happen before

thread a startsexecuting”rd a, a

rd b, a

rd c, a

rd d, a

rd e, a

wr e, a

sync b, a

sync e, a

a

•Vectors, Tasks, Integers captured•Parent, child access objects•Parent completes accesses

before child starts accesses•Can allocate objects on child’s

per-thread heap

Interthread Analysis Result from End of generateTask Method

Thread Overhead• Inefficient Thread Implementations

• Thread Creation Overhead• Thread Management Overhead• Stack Overhead

• Use a more efficient thread implementation• User-level thread management• Per-thread heaps• Event-driven form

Standard Thread Implementation

return addressframe pointer

xy


bc

a

•Call frames allocated on stack•Context Switch

• Save state on stack• Resume another thread

•One stack per thread

Standard Thread Implementation


xy


bc

a

save area


• Save state on stack• Resume another thread

•One stack per thread

Event-Driven Formreturn addressframe pointer

xy


bc

a


• Build continuation on heap• Copy out live variables• Return out of computation• Resume another continuation

•One stack per processor

c

x

resumemethod

resumemethod

Complications• Standard thread models use blocking

I/O• Automatically convert blocking I/O to

asynchronous I/O• Scheduler manages interleaving of

thread executions• Stack Allocatable Objects May Be Live

Across Blocking Calls• Transfer allocation to per-thread heap

Opportunity• On a uniprocessor, compiler controls

placement of context switch points• If program does not hold lock across

blocking call, can eliminate lock

Experimental Results• MIT Flex Compiler System

• Static Compiler• Native code for StrongARM

• Server Benchmarks • http, phone, echo, time

• Scientific Computing Benchmarks• water, barnes

Server Benchmark Characteristics

IR Size

(instrs)

Number of

Methods

PreAnalysis

Time (secs)

echo 4,639 131 28

time 4,573 136 29

http 10,643 292 103

phone 9,547 267 75

IntraThreadAnalysis

Time (secs)

InterThreadAnalysis

Time (secs)

74

70

199

191

73

74

269

256

Percentage of Eliminated Synchronization Operations

0

20

40

60

80

100

http phone time echo mtrt

Intrathread only

Interthread

Compilation Options for Performance Results

• Standard• kernel threads, synch included

• Event-Driven• event-driven, no synch at all

• +Per-Thread Heap• event-driven, no synch at all, per-

thread heap allocation

Throughput (Responses per Second)

StandardEvent-Driven+Per-Thread

Heap

echo time http2K

http20K

0

100

200

300

400

phone

water 25,583 335 1156

IR Size(instrs)

Number ofMethods

Total AnalysisTime (secs)

barnes 19,764 364 491

380

Pre AnalysisTime (secs)

129

Scientific Benchmark Characteristics

Compiler Options0: Sequential C++1: Baseline - Kernel Threads2: Lightweight Threads3: Lightweight Threads + Stack Allocation4: Lightweight Threads + Stack Allocation

- Synchronization

0

0.2

0.4

0.6

0.8

1

Baseline +Light +Stack -Synch

Execution Times

Proportion of Sequential C++ Execution Timewater small water barnes

Related Work• Pointer Analysis for Sequential

Programs• Chatterjee, Ryder, Landi (POPL 99)• Sathyanathan & Lam (LCPC 96)• Steensgaard (POPL 96)• Wilson & Lam (PLDI 95)• Emami, Ghiya, Hendren (PLDI 94)• Choi, Burke, Carini (POPL 93)

Related Work• Pointer Analysis for Multithreaded Programs

• Rugina and Rinard (PLDI 99) (fork-join parallelism, not compositional)

• We have extended our points-to analysis for multithreaded programs (irregular, thread-based concurrency, compositional)

• Escape Analysis• Blanchet (POPL 98)• Deutsch (POPL 90, POPL 97)• Park & Goldberg (PLDI 92)

Related Work• Synchronization Optimizations

• Diniz & Rinard (LCPC 96, POPL 97)• Plevyak, Zhang, Chien (POPL 95)• Aldrich, Chambers, Sirer, Eggers (SAS99)• Blanchet (OOPSLA 99)• Bogda, Hoelzle (OOPSLA 99)• Choi, Gupta, Serrano, Sreedhar, Midkiff

(OOPSLA 99)• Ruf (PLDI 00)

Conclusion• New Analysis Algorithm

• Flow-sensitive, compositional• Multithreaded programs• Explicitly represent interactions between

analyzed and unanalyzed parts• Analysis Uses

• Synchronization elimination• Stack allocation• Per-thread heap allocation

• Lightweight Threads

analyses and optimizations for multithreaded programs

Documents

int sum

integer e

enumeration e

hasmoreelements sum

int value

w dest

javaclass accumulator

vector v