cmpp 2004 - stirling - scotlandgroups.di.unipi.it/~aldinuc/talks/2004_lithium_cmpp_talk.pdf ·...

28
CMPP 2004 - Stirling - Scotland Optimization Techniques for Implementing Parallel Skeletons in Grid Environments Marco Aldinucci, ISTI-CNR, Pisa, Italy Marco Danelutto, CS dept. Uni. Pisa, Italy Jan Dünnweber, CS dept. Uni. Münster, Germany

Upload: others

Post on 30-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • CMPP 2004 - Stirling - Scotland

    Optimization Techniques for ImplementingParallel Skeletons in Grid Environments

    Marco Aldinucci, ISTI-CNR, Pisa, ItalyMarco Danelutto, CS dept. Uni. Pisa, Italy

    Jan Dünnweber, CS dept. Uni. Münster, Germany

  • Outline

    Lithium

    a pure Java skeletal framework exploiting MacroDataFlow

    Future-based RMI

    Future-based Lithium

    Task lookahead, server-to-server lazy binding, load balancing

    Definitely-not-Grid, quasi-Grid & maybe-Grid experiments

    Measuring performance How-To & discussion

  • Research roadmap

    Lithium- a skeletal Java framework- Uni. Pisa, Italy, 2001- FGCS 19(5), 2003

    Future-based RMI - asynchronous Java-RMI- Uni. Münster (Berlin), Germany- Euro-Par 2003, LNCS 2790

    Future-based Lithium- implementation & validation- results analysis - CMPP’04

  • Lithium

    A pure Java framework

    providing skeletons as classes

    seq, pipe, farm, map, reduce, D&C, while, ..

    all can be nested in any order

    all accept a stream of input tasks and deliver a stream of output tasks

    ad-hoc parallelism via Java primitives (RMI, sockets, threads, ...)

    in adherence to Cole’s manifesto

    e.g. skeletons are natively stateless, however the programmer may define a shared state by creating a storage server via RMI

  • Coding example: a simple pipe

    public PipeTest( ) { pipe = new Pipeline( ); pipe.addStage(new Blur( )); pipe.addStage(new Oil( )); }

    public static void main(String[ ] args) { PipeTest test = new PipeTest( ); Ske lithium = new Ske( ); lithium.addHosts(args); // define running environment lithium.skelProgram(test.pipe); lithium.setupTaskPool( ...); // set up input stream lithium.parDo( ); // start parallel execution while (!lithium.isResEmpty( )) { // extract results }}

  • Set up input stream

    TaskPool (input)

    blur( )

    oil( )

    TaskPool (output)

  • apply the first stage: blur( )

    TaskPool (input)

    blur( )

    oil( )

    TaskPool (output)

  • apply the second stage: oil( )

    TaskPool (input)

    blur( )

    oil( )

    TaskPool (output)

  • From skeletons to MacroDataFlowpublic StageMap(JSkeletons theSke ) { map = new Map(theSke); map.setParamRule(BLOCK); map.setOutRule(BLOCK); }public PipeMapMap( ) { StageMap stage1 = new StageMap(new Mandel()); StageMap stage2 = new StageMap(new Filter()); pipe = new Pipeline(stage1,stage2); }

    stage1 stage2

    ...

    ...

    ... ...

    ...

    ...

    Mandel Filter

    Lithium JIT

  • MDF-instructions (MDFi) execution

    The client is the user program enriched with framework code (e.g. control threads)

    The execution of a MDFi may trigger one or more other MDFi

    Actually the MDF is sent once to servers, then referred by indexes

    ...

    ready queueClient

    waiting fortokens queue

    ...

    ...

    controlthread

    controlthread

    controlthread

    ...LithiumServer

    LithiumServer

    LithiumServer

  • Future-based RMI

    a flavor of asynchronous RMI

    a remote method invocation that immediately returns a future, i.e. a reference to the return value

    once figured out, the real value must be linked to its future by issuing future.setValue( )

    the real value may be retrieved by anybody holding the future by issuing future.getValue( )

    getValue( ) forces the issuing thread to wait for a setValue( ) issued by the partner

  • Standard RMI vs Future RMI

    A = MDFi_1( a )

    A

    B = MDFi_2( A )

    B

    Client

    LithiumServer1

    LithiumServer2

    A = MDFi_1 ( a )

    ref ( A )

    B = MDFi_2 ( ref ( A ) )

    A = get ( ref ( in ) )

    A

    B

    Client

    LithiumServer1

    LithiumServer2

    A = LithiumServer1.evalMDF1(a);B = LithiumServer2.evalMDF2(A);

  • Future-based RMI: details

    get/setValue() relies on a (not-RMI) proxy class allowing:

    the invocation as local method (partner is unknown)

    the local caching of values

    the future reference implementation includes both object’s hashvalue and IP address

  • Future Based Lithium

    Lithium + Future-based RMI

    improve Lithium in three ways:

    task lookahead

    server-to-server lazy binding

    load balancing

  • Task lookahead

    get task

    from TP

    execute

    add task

    to TP

    get task

    from TP

    execute

    Lithium Application (Control Thread)

    Lithium Server

    time

    ...

    ...

    idle time (overhead)

    RMI commRMI commRMI comm

    get task

    from TP

    execute execute

    Lithium Application (Control Thread)

    Lithium Server

    time

    ...

    ...future

    add task

    to TP

    add task

    to TP

    get task

    from TP

    future

  • Server-to-server lazy binding

    Client

    Thr1

    Thr2Thr3

    LithiumServer1

    Thr1Thr2

    Thr3

    LithiumServer2

    Servers directly communicate one each other, messages carrying large data are halved

    frequent pattern in Lithium (occurs for each skeleton but the farm)

    data may be locally cached (no communications)

  • Load balancing

    Control Thread 2

    Control Thread n

    Control Thread 1

    ... ...

    Lithium Server 1

    Lithium Server 2

    Lithium Server n

    dispatch ne

    w task

    6 threads active

    wait ...

    dispatch new task4 threads active

    5 threads acti

    ve

    Lithium Task

    Pool Scheduler

    idle thread active thread

    Each control threads count and limit the number of task assigned to each server. Moreover the clients use statistics in order to

    understand server power and load

  • Experiments

    Three different scenarios:

    dedicated cluster

    client + SMP server

    a Grid-like environment

    Interesting results

    Grid: useful metrics: problems & proposal

    RLX blade - 24 P4@800MHz

    internet

    Client (Muenster)Server SMP (Berlin)

    Eth100

    Eth100 Eth100802.11b

    802.11g

    Italian

    backbone

    (ATM)

    di.unipi.it (Pisa)

    isti.cnr.it (Ghezzano)

    Grid-likeenvironment

  • 1st scenario

    RLX blade 24 PEsprocessors have uniform power (task scheduling)

    low latency - high bandwidth net

    Optimizations significantly improve speedup 0

    24681012141618

    1 4 7 10 13 16

    Processing Elements

    Speedup

    Standard LithiumOptimized LithiumIdeal (zero communication costs)

  • 2nd scenario

    Client - Cluster of serversHigh latency - low bandwidth between client and server set

    Low latency - high bandwidth among servers

    Very good results, indeed, optimizations mainly:

    Increase client-server message rate, while reducing their size

    Introduce server-server messages

    internet

    Client Servers

    0

    75

    150

    225

    300

    1 2 3 4Number of Servers

    Tim

    e (S

    ecs)

    Standard Lithium Optimized Lithium

  • Eth100

    Eth100 Eth100802.11b

    802.11g

    Italian

    backbone

    (ATM)

    di.unipi.it (Pisa)

    isti.cnr.it (Ghezzano)

    Boxes have different powers(46:1 max ratio)

    Net performanceTwo FirewallsATM, Eth100, WiFi 11/54

    Operating SystemsLinux, MacOSX, Windows

    HW architectureSingle CPU and SMPP2, P3, P4, HTP4, G4, G5

    If it isn’t a Grid,it looks very alike

    3rd scenario

  • BogoPower

    Models machine power on (tasks/sec) on a single PE

    neglect net performance

    what scalability means in such scenario?

    another metric is needed 0

    0.175

    0.350

    0.525

    0.700

    P2@2

    33M

    Hz

    P3@1

    .1GHz

    G4@8

    00M

    Hz

    G4@8

    67M

    Hz

    P4@1

    .7GHz

    P4@2

    .8GHz

    2xP3

    @550

    Mhz

    2xP4

    @800

    MHz

    2xG5

    @2GH

    z

    4xP4

    @2.8G

    Hz

    WiF

    i “b

    WiF

    i “g

  • Standard vs Future Lithium

    0

    200

    400

    600

    800

    1000

    1200

    0 0.5 1 1.5 2 2.5 3 3.5

    Aggregate BogoPower

    Com

    ple

    tion t

    ime

    (Sec

    s)

    Standard Lithium Optimized LithiumIdeal (zero

    communication costs)

    WiFi-b turned in

  • G4@867MHz(Air)[email protected]

    [email protected] servers

    Effect of the smart load balancing

    Total number of tasks assigned

    The most powerlful server has been

    externally overloaded from time 10 to 70

  • In depth: Load balancing

    0

    2

    4

    6

    8

    10

    12

    0 50 100 150 200

    Wall Clock Time (Secs)

    Serv

    ice t

    ime/t

    ask

    0

    2

    4

    6

    8

    10

    12

    14

    N.

    act

    ive t

    hre

    ads

  • Conclusions

    Skeletons payback: very compact code, programmer is not required to handle possible differences in Fut-RMi/RMI semantics

    Optimization very effective on first two scenarios,

    interesting on thirdchanging the order in which power is added to the “Grid”, performances change , scheduling in much more difficult, more investigation is needed. Good metrics are needed.

    in next evolution (already working) resources may be dynamically added (Jaxta-based)

  • How we coped with firewall issues

    Potsdamer Platz, 1987

  • We didn’t ... Thank you ... Questions?

    Potsdamer Platz, 1990

    http://java.sun.com/j2se/1.4.2/docs/guide/rmi/spec/rmi-arch6.html

    3.5 RMI Through Firewalls ...

    Calls transmitted via HTTP requests are at least an order of magnitude slower that those sent through direct sockets, without taking proxy forwarding delays into consideration.