cmpp 2004 - stirling - scotlandgroups.di.unipi.it/~aldinuc/talks/2004_lithium_cmpp_talk.pdf ·...

CMPP 2004 - Stirling - Scotland

Optimization Techniques for ImplementingParallel Skeletons in Grid Environments

Marco Aldinucci, ISTI-CNR, Pisa, ItalyMarco Danelutto, CS dept. Uni. Pisa, Italy

Jan Dünnweber, CS dept. Uni. Münster, Germany

Outline

Lithium

a pure Java skeletal framework exploiting MacroDataFlow

Future-based RMI

Future-based Lithium

Task lookahead, server-to-server lazy binding, load balancing

Definitely-not-Grid, quasi-Grid & maybe-Grid experiments

Measuring performance How-To & discussion

Research roadmap

Lithium- a skeletal Java framework- Uni. Pisa, Italy, 2001- FGCS 19(5), 2003

Future-based RMI - asynchronous Java-RMI- Uni. Münster (Berlin), Germany- Euro-Par 2003, LNCS 2790

Future-based Lithium- implementation & validation- results analysis - CMPP’04

Lithium

A pure Java framework

providing skeletons as classes

seq, pipe, farm, map, reduce, D&C, while, ..

all can be nested in any order

all accept a stream of input tasks and deliver a stream of output tasks

ad-hoc parallelism via Java primitives (RMI, sockets, threads, ...)

in adherence to Cole’s manifesto

e.g. skeletons are natively stateless, however the programmer may define a shared state by creating a storage server via RMI

Coding example: a simple pipe

public PipeTest( ) { pipe = new Pipeline( ); pipe.addStage(new Blur( )); pipe.addStage(new Oil( )); }

public static void main(String[ ] args) { PipeTest test = new PipeTest( ); Ske lithium = new Ske( ); lithium.addHosts(args); // define running environment lithium.skelProgram(test.pipe); lithium.setupTaskPool( ...); // set up input stream lithium.parDo( ); // start parallel execution while (!lithium.isResEmpty( )) { // extract results }}

Set up input stream

TaskPool (input)

blur( )

oil( )

TaskPool (output)

apply the first stage: blur( )

TaskPool (input)

blur( )

oil( )

TaskPool (output)

apply the second stage: oil( )

TaskPool (input)

blur( )

oil( )

TaskPool (output)

From skeletons to MacroDataFlowpublic StageMap(JSkeletons theSke ) { map = new Map(theSke); map.setParamRule(BLOCK); map.setOutRule(BLOCK); }public PipeMapMap( ) { StageMap stage1 = new StageMap(new Mandel()); StageMap stage2 = new StageMap(new Filter()); pipe = new Pipeline(stage1,stage2); }

stage1 stage2

...

...

... ...

...

...

Mandel Filter

Lithium JIT

MDF-instructions (MDFi) execution

The client is the user program enriched with framework code (e.g. control threads)

The execution of a MDFi may trigger one or more other MDFi

Actually the MDF is sent once to servers, then referred by indexes

...

ready queueClient

waiting fortokens queue

...

...

controlthread

controlthread

controlthread

...LithiumServer

LithiumServer

LithiumServer

Future-based RMI

a flavor of asynchronous RMI

a remote method invocation that immediately returns a future, i.e. a reference to the return value

once figured out, the real value must be linked to its future by issuing future.setValue( )

the real value may be retrieved by anybody holding the future by issuing future.getValue( )

getValue( ) forces the issuing thread to wait for a setValue( ) issued by the partner

Standard RMI vs Future RMI

A = MDFi_1( a )

A

B = MDFi_2( A )

B

Client

LithiumServer1

LithiumServer2

A = MDFi_1 ( a )

ref ( A )

B = MDFi_2 ( ref ( A ) )

A = get ( ref ( in ) )

A

B

Client

LithiumServer1

LithiumServer2

A = LithiumServer1.evalMDF1(a);B = LithiumServer2.evalMDF2(A);

Future-based RMI: details

get/setValue() relies on a (not-RMI) proxy class allowing:

the invocation as local method (partner is unknown)

the local caching of values

the future reference implementation includes both object’s hashvalue and IP address

Future Based Lithium

Lithium + Future-based RMI

improve Lithium in three ways:

task lookahead

server-to-server lazy binding

load balancing

Task lookahead

get task

from TP

execute

add task

to TP

get task

from TP

execute

Lithium Application (Control Thread)

Lithium Server

time

...

...

idle time (overhead)

RMI commRMI commRMI comm

get task

from TP

execute execute

Lithium Application (Control Thread)

Lithium Server

time

...

...future

add task

to TP

add task

to TP

get task

from TP

future

Server-to-server lazy binding

Client

Thr1

Thr2Thr3

LithiumServer1

Thr1Thr2

Thr3

LithiumServer2

Servers directly communicate one each other, messages carrying large data are halved

frequent pattern in Lithium (occurs for each skeleton but the farm)

data may be locally cached (no communications)

Load balancing

Control Thread 2

Control Thread n

Control Thread 1

... ...

Lithium Server 1

Lithium Server 2

Lithium Server n

dispatch ne

w task

6 threads active

wait ...

dispatch new task4 threads active

5 threads acti

ve

Lithium Task

Pool Scheduler

idle thread active thread

Each control threads count and limit the number of task assigned to each server. Moreover the clients use statistics in order to

understand server power and load

Experiments

Three different scenarios:

dedicated cluster

client + SMP server

a Grid-like environment

Interesting results

Grid: useful metrics: problems & proposal

RLX blade - 24 P4@800MHz

internet

Client (Muenster)Server SMP (Berlin)

Eth100

Eth100 Eth100802.11b

802.11g

Italian

backbone

(ATM)

di.unipi.it (Pisa)

isti.cnr.it (Ghezzano)

Grid-likeenvironment

1st scenario

RLX blade 24 PEsprocessors have uniform power (task scheduling)

low latency - high bandwidth net

Optimizations significantly improve speedup 0

24681012141618

1 4 7 10 13 16

Processing Elements

Speedup

Standard LithiumOptimized LithiumIdeal (zero communication costs)

2nd scenario

Client - Cluster of serversHigh latency - low bandwidth between client and server set

Low latency - high bandwidth among servers

Very good results, indeed, optimizations mainly:

Increase client-server message rate, while reducing their size

Introduce server-server messages

internet

Client Servers

0

75

150

225

300

1 2 3 4Number of Servers

Tim

e (S

ecs)

Standard Lithium Optimized Lithium

Eth100

Eth100 Eth100802.11b

802.11g

Italian

backbone

(ATM)

di.unipi.it (Pisa)

isti.cnr.it (Ghezzano)

Boxes have different powers(46:1 max ratio)

Net performanceTwo FirewallsATM, Eth100, WiFi 11/54

Operating SystemsLinux, MacOSX, Windows

HW architectureSingle CPU and SMPP2, P3, P4, HTP4, G4, G5

If it isn’t a Grid,it looks very alike

3rd scenario

BogoPower

Models machine power on (tasks/sec) on a single PE

neglect net performance

what scalability means in such scenario?

another metric is needed 0

0.175

0.350

0.525

0.700

P2@2

33M

Hz

P3@1

.1GHz

G4@8

00M

Hz

G4@8

67M

Hz

P4@1

.7GHz

P4@2

.8GHz

2xP3

@550

Mhz

2xP4

@800

MHz

2xG5

@2GH

z

4xP4

@2.8G

Hz

WiF

i “b

”

WiF

i “g

”

Standard vs Future Lithium

0

200

400

600

800

1000

1200

0 0.5 1 1.5 2 2.5 3 3.5

Aggregate BogoPower

Com

ple

tion t

ime

(Sec

s)

Standard Lithium Optimized LithiumIdeal (zero

communication costs)

WiFi-b turned in

G4@867MHz(Air)[email protected]

[email protected] servers

Effect of the smart load balancing

Total number of tasks assigned

The most powerlful server has been

externally overloaded from time 10 to 70

In depth: Load balancing

0

2

4

6

8

10

12

0 50 100 150 200

Wall Clock Time (Secs)

Serv

ice t

ime/t

ask

0

2

4

6

8

10

12

14

N.

act

ive t

hre

ads

Conclusions

Skeletons payback: very compact code, programmer is not required to handle possible differences in Fut-RMi/RMI semantics

Optimization very effective on first two scenarios,

interesting on thirdchanging the order in which power is added to the “Grid”, performances change , scheduling in much more difficult, more investigation is needed. Good metrics are needed.

in next evolution (already working) resources may be dynamically added (Jaxta-based)

How we coped with firewall issues

Potsdamer Platz, 1987

We didn’t ... Thank you ... Questions?

Potsdamer Platz, 1990

http://java.sun.com/j2se/1.4.2/docs/guide/rmi/spec/rmi-arch6.html

3.5 RMI Through Firewalls ...

Calls transmitted via HTTP requests are at least an order of magnitude slower that those sent through direct sockets, without taking proxy forwarding delays into consideration.

cmpp 2004 - stirling - scotlandgroups.di.unipi.it/~aldinuc/talks/2004_lithium_cmpp_talk.pdf ·...

Documents