cmpp 2004 - stirling - scotlandgroups.di.unipi.it/~aldinuc/talks/2004_lithium_cmpp_talk.pdf ·...
TRANSCRIPT
-
CMPP 2004 - Stirling - Scotland
Optimization Techniques for ImplementingParallel Skeletons in Grid Environments
Marco Aldinucci, ISTI-CNR, Pisa, ItalyMarco Danelutto, CS dept. Uni. Pisa, Italy
Jan Dünnweber, CS dept. Uni. Münster, Germany
-
Outline
Lithium
a pure Java skeletal framework exploiting MacroDataFlow
Future-based RMI
Future-based Lithium
Task lookahead, server-to-server lazy binding, load balancing
Definitely-not-Grid, quasi-Grid & maybe-Grid experiments
Measuring performance How-To & discussion
-
Research roadmap
Lithium- a skeletal Java framework- Uni. Pisa, Italy, 2001- FGCS 19(5), 2003
Future-based RMI - asynchronous Java-RMI- Uni. Münster (Berlin), Germany- Euro-Par 2003, LNCS 2790
Future-based Lithium- implementation & validation- results analysis - CMPP’04
-
Lithium
A pure Java framework
providing skeletons as classes
seq, pipe, farm, map, reduce, D&C, while, ..
all can be nested in any order
all accept a stream of input tasks and deliver a stream of output tasks
ad-hoc parallelism via Java primitives (RMI, sockets, threads, ...)
in adherence to Cole’s manifesto
e.g. skeletons are natively stateless, however the programmer may define a shared state by creating a storage server via RMI
-
Coding example: a simple pipe
public PipeTest( ) { pipe = new Pipeline( ); pipe.addStage(new Blur( )); pipe.addStage(new Oil( )); }
public static void main(String[ ] args) { PipeTest test = new PipeTest( ); Ske lithium = new Ske( ); lithium.addHosts(args); // define running environment lithium.skelProgram(test.pipe); lithium.setupTaskPool( ...); // set up input stream lithium.parDo( ); // start parallel execution while (!lithium.isResEmpty( )) { // extract results }}
-
Set up input stream
TaskPool (input)
blur( )
oil( )
TaskPool (output)
-
apply the first stage: blur( )
TaskPool (input)
blur( )
oil( )
TaskPool (output)
-
apply the second stage: oil( )
TaskPool (input)
blur( )
oil( )
TaskPool (output)
-
From skeletons to MacroDataFlowpublic StageMap(JSkeletons theSke ) { map = new Map(theSke); map.setParamRule(BLOCK); map.setOutRule(BLOCK); }public PipeMapMap( ) { StageMap stage1 = new StageMap(new Mandel()); StageMap stage2 = new StageMap(new Filter()); pipe = new Pipeline(stage1,stage2); }
stage1 stage2
...
...
... ...
...
...
Mandel Filter
Lithium JIT
-
MDF-instructions (MDFi) execution
The client is the user program enriched with framework code (e.g. control threads)
The execution of a MDFi may trigger one or more other MDFi
Actually the MDF is sent once to servers, then referred by indexes
...
ready queueClient
waiting fortokens queue
...
...
controlthread
controlthread
controlthread
...LithiumServer
LithiumServer
LithiumServer
-
Future-based RMI
a flavor of asynchronous RMI
a remote method invocation that immediately returns a future, i.e. a reference to the return value
once figured out, the real value must be linked to its future by issuing future.setValue( )
the real value may be retrieved by anybody holding the future by issuing future.getValue( )
getValue( ) forces the issuing thread to wait for a setValue( ) issued by the partner
-
Standard RMI vs Future RMI
A = MDFi_1( a )
A
B = MDFi_2( A )
B
Client
LithiumServer1
LithiumServer2
A = MDFi_1 ( a )
ref ( A )
B = MDFi_2 ( ref ( A ) )
A = get ( ref ( in ) )
A
B
Client
LithiumServer1
LithiumServer2
A = LithiumServer1.evalMDF1(a);B = LithiumServer2.evalMDF2(A);
-
Future-based RMI: details
get/setValue() relies on a (not-RMI) proxy class allowing:
the invocation as local method (partner is unknown)
the local caching of values
the future reference implementation includes both object’s hashvalue and IP address
-
Future Based Lithium
Lithium + Future-based RMI
improve Lithium in three ways:
task lookahead
server-to-server lazy binding
load balancing
-
Task lookahead
get task
from TP
execute
add task
to TP
get task
from TP
execute
Lithium Application (Control Thread)
Lithium Server
time
...
...
idle time (overhead)
RMI commRMI commRMI comm
get task
from TP
execute execute
Lithium Application (Control Thread)
Lithium Server
time
...
...future
add task
to TP
add task
to TP
get task
from TP
future
-
Server-to-server lazy binding
Client
Thr1
Thr2Thr3
LithiumServer1
Thr1Thr2
Thr3
LithiumServer2
Servers directly communicate one each other, messages carrying large data are halved
frequent pattern in Lithium (occurs for each skeleton but the farm)
data may be locally cached (no communications)
-
Load balancing
Control Thread 2
Control Thread n
Control Thread 1
... ...
Lithium Server 1
Lithium Server 2
Lithium Server n
dispatch ne
w task
6 threads active
wait ...
dispatch new task4 threads active
5 threads acti
ve
Lithium Task
Pool Scheduler
idle thread active thread
Each control threads count and limit the number of task assigned to each server. Moreover the clients use statistics in order to
understand server power and load
-
Experiments
Three different scenarios:
dedicated cluster
client + SMP server
a Grid-like environment
Interesting results
Grid: useful metrics: problems & proposal
RLX blade - 24 P4@800MHz
internet
Client (Muenster)Server SMP (Berlin)
Eth100
Eth100 Eth100802.11b
802.11g
Italian
backbone
(ATM)
di.unipi.it (Pisa)
isti.cnr.it (Ghezzano)
Grid-likeenvironment
-
1st scenario
RLX blade 24 PEsprocessors have uniform power (task scheduling)
low latency - high bandwidth net
Optimizations significantly improve speedup 0
24681012141618
1 4 7 10 13 16
Processing Elements
Speedup
Standard LithiumOptimized LithiumIdeal (zero communication costs)
-
2nd scenario
Client - Cluster of serversHigh latency - low bandwidth between client and server set
Low latency - high bandwidth among servers
Very good results, indeed, optimizations mainly:
Increase client-server message rate, while reducing their size
Introduce server-server messages
internet
Client Servers
0
75
150
225
300
1 2 3 4Number of Servers
Tim
e (S
ecs)
Standard Lithium Optimized Lithium
-
Eth100
Eth100 Eth100802.11b
802.11g
Italian
backbone
(ATM)
di.unipi.it (Pisa)
isti.cnr.it (Ghezzano)
Boxes have different powers(46:1 max ratio)
Net performanceTwo FirewallsATM, Eth100, WiFi 11/54
Operating SystemsLinux, MacOSX, Windows
HW architectureSingle CPU and SMPP2, P3, P4, HTP4, G4, G5
If it isn’t a Grid,it looks very alike
3rd scenario
-
BogoPower
Models machine power on (tasks/sec) on a single PE
neglect net performance
what scalability means in such scenario?
another metric is needed 0
0.175
0.350
0.525
0.700
P2@2
33M
Hz
P3@1
.1GHz
G4@8
00M
Hz
G4@8
67M
Hz
P4@1
.7GHz
P4@2
.8GHz
2xP3
@550
Mhz
2xP4
@800
MHz
2xG5
@2GH
z
4xP4
@2.8G
Hz
WiF
i “b
”
WiF
i “g
”
-
Standard vs Future Lithium
0
200
400
600
800
1000
1200
0 0.5 1 1.5 2 2.5 3 3.5
Aggregate BogoPower
Com
ple
tion t
ime
(Sec
s)
Standard Lithium Optimized LithiumIdeal (zero
communication costs)
WiFi-b turned in
-
G4@867MHz(Air)[email protected]
[email protected] servers
Effect of the smart load balancing
Total number of tasks assigned
The most powerlful server has been
externally overloaded from time 10 to 70
-
In depth: Load balancing
0
2
4
6
8
10
12
0 50 100 150 200
Wall Clock Time (Secs)
Serv
ice t
ime/t
ask
0
2
4
6
8
10
12
14
N.
act
ive t
hre
ads
-
Conclusions
Skeletons payback: very compact code, programmer is not required to handle possible differences in Fut-RMi/RMI semantics
Optimization very effective on first two scenarios,
interesting on thirdchanging the order in which power is added to the “Grid”, performances change , scheduling in much more difficult, more investigation is needed. Good metrics are needed.
in next evolution (already working) resources may be dynamically added (Jaxta-based)
-
How we coped with firewall issues
Potsdamer Platz, 1987
-
We didn’t ... Thank you ... Questions?
Potsdamer Platz, 1990
http://java.sun.com/j2se/1.4.2/docs/guide/rmi/spec/rmi-arch6.html
3.5 RMI Through Firewalls ...
Calls transmitted via HTTP requests are at least an order of magnitude slower that those sent through direct sockets, without taking proxy forwarding delays into consideration.