contributions to the modelisation and optimisation of large scale distributed computing

Contributions to the modelisation and optimisation of large scale distributed computing

Cécile Germain-RenaudLRI and LAL

http://www.lri.fr/~cecile/RAPH

Habilitation à diriger des recherches

HDR 09/07/2005 2

Summary

• Introduction

• A protocol and a model for Global Computing

• Fault tolerant message-passing

• Grid result-checking

• Grid-enabling medical image analysis

• Perspectives

HDR 09/07/2005 3

Old ideas

« Programmers at computer A have a blurred photo which they want to put into focus. Their program transmits the photo to computer B, which specializes in computer graphics (…). If B requires specialized computer assistance, it may call on computer C for help »

HDR 09/07/2005 4

High performance computing systemsMassively

parallelMassively Distributed

•Homogeneous hard/software•Internal network•Static

•Heterogenous hard/software•Internet•Autonomic management

Blue Gen

e L

Tera Grid

-DEISA

Desktop

grids

SETI@ho

me

OSGEGEE

Tera co

mputer

Cluster

s

HDR 09/07/2005 5

High performance computing systems: new issuesMassively

parallelMassively distributed

•Fault-free•CPU-centric

•Monolithic applications•Performance is speedup

•Exclusive•Application-level scheduling

•«Simple » – N1/2, LogP

•Faults are normal events•Data-centric

•EP or moldable applications•Performance is throughput

•Time-shared•Middleware scheduling

•Very complex

Blue Gen

e L

Tera Grid

-DEISA

Desktop

grids

SETI@ho

me

OSGEGEE

Tera co

mputer

Cluster

s

HDR 09/07/2005 6

Summary

• Introduction

• A protocol and a model for volatile computing




• Perspectives

HDR 09/07/2005 7

XtremWeb architecture

• Explores the Global Computing modality of grids• A large scale distributed system

– Computing resources (collaborators) are• volatile: come and go unexpectedly• « at the edge of the internet »: low-level and anonymous

– Dedicated to high throughput (like Condor)– vs applets systems• Consequences

a RISgC: Reduced Infrastructure Software for Grid Computing– Not master-slave, but a pull model: the collaborators decide when and

what– A soft-state dispatcher/collaborator protocol: the collaborator state

expires if not refreshed– Deployed at Paris-Sud University for Auger

Dispatcher

WorkerWorkerWorkerCollaborator

Request

ReplyKeepalive

HDR 09/07/2005 8

A performance model

• Applications are not so much moldable – unit duration

• Soft-state introduces overhead – keepalive period

• The fault process on one site is largely unpredictable [Dinda 99]

• But for each task, the fault process is memoryless if the successive execution sites are uncorrelated [Libel et al 02]: Poisson process where is the fault rate

ee

T1

11

1

1

e

T

The system cannot be tuned even for infinite resource

HDR 09/07/2005 9

The Auger Observatory

• Detection of ultra-high energy cosmic rays

HDR 09/07/2005 10



• UHCR create particle showers

HDR 09/07/2005 11



• UHCR create particle showers

• Indirect observation– Ground detectors– Fluorescence telescope

• In silico experiments– Shower simulations – CCIN2P3 - XtremWeb

200 physicists, 55 institutions, 15 countries, 3000km2, 1600 tanks, 30 years life time…

HDR 09/07/2005 12

Publications

–F. Cappello, A. Djilali, G. Fedak, C. Germain, O. Lodygensky et V. Neri. Calcul réparti a grande échelle, chapitre XtremWeb : une plateforme de recherche sur le calcul global et pair a pair, pages 153-186 , Lavoisier, 2002. –G. Fedak, C. Germain, V. Neri and F. Cappello. XtremWeb: A generic global computing platform. In IEEE/ACM CCGRID'2001, pages 582-587, IEEE Press, 2001. –C. Germain, G. Fedak, V. Neri and F. Cappello. Global Computing Systems. In 3rd Int. Conf. on Large Scale Scientific Computations, LNCS 2179, pages 218-227, Sozopol, 2001. Springer-Verlag. –C. Germain, V. Néri, G. Fedak and F. Cappello. Xtremweb : Building an experimental platform for global computing. In Proc. 1st IEEE/ACM Intl. Workshop Grid 2000. Springer, 2000.

HDR 09/07/2005 13

Summary

• Introduction




• Grid-enabling medical imaging

• Perspectives - Towards a grid observatory

HDR 09/07/2005 14

Why do we need to check grid computations more carefully ?

• Attacks are likely on Global Computing systems– Low control over collaborators – Real-world issue: some happened in all deployed GC systems

even with a unique binary application • SETI - wrong FFT, Decrypthon I had 5% errors• All double- or triple- checked their results

• Might happen also in grid systems– Submission tools and grid workflow management are in early

stage– Code version management

HDR 09/07/2005 15

Contexts & Related Work

Hardware support

TCPA/Palladium

Code

encryption

Result-checking: Blum

Property testing: Goldreich

Prevention

Detection

Check that a property holds for an object

Program output: Sorted array

Graph properties: random graph sampleEn masse

checking [Sarmenta FGCS 2002]

HDR 09/07/2005 16

Result checking and Grid Applications

• Typical grid use cases: no independent property to check

• Monte-Carlo simulationsRange of parameter values x internal randomizationLocal interactions: the specification is the program Unknown shape -> non parametric Fault tolerance through robust statistics

• Search for rare events: SETI Not fault tolerance

HDR 09/07/2005 17

Example: shower simulations

Shower Simulation

Fe, 1E20, Fe, 1E20, Pr 1.5E20, ’’• Only the input parameters can

be falsified• But extracting the input

parameters from the data is just the problem!

Detector Simulation

Reconstruction

HDR 09/07/2005 18

En masse result-checking

• Goals – Minimize the checking overhead through adaptive tests

The most likely situations are• Normal: the majority of collaborators are OK• Massive attack: the majority of collaborators cheat (or err)

– Robust to denial of service attacks: the system is unable to assess the quality of its production

– Efficient for anonymous execution: private network, IP spoofing

• Results– Generic 2-phase test based on Wald’s sequential test– Improvement for the Auger Showers: pre-qualification of showers through

empirical detection of outliers

HDR 09/07/2005 19

Overview

Sample Qualification

Batch segmentation Sample selection

OracleRe-execution

HDR 09/07/2005 20

Publications

• C. Germain and D. Monnier-Ragaigne. Grid Result Chechking. In Procs. 2nd Computing Frontiers, Ischia, Mai 2005. ACM Press.

• C. Germain and N. Playez. Result-Checking in Global Computing Systems. In Procs.17h ACM Int.Conf. on Supercomputing, pages 226-233, San Francisco, June 2003. ACM Press.

HDR 09/07/2005 21

Summary

• Introduction




• Grid-enabling medical imaging

• Perspectives - Towards a grid observatory

HDR 09/07/2005 22

Grids create new contexts for message passing

• Message passing environments and especially MPI are the standard for parallel computing, but have been designed with MPP in mind: fault free + internal network

• New use cases for message passing– Global Computing

• Loosely coupled computations, very frequent faults

– Institutional Grids• Coupled computations, moderately frequent faults

– Very large clusters• Tightly coupled computations, unfrequent faults• Or frequent: time-sharing cf the Connection Machine

Fault tolerance

Tunneling

Fault-free performance

FT-MPI

MPI-FT

FT/MPI

…

HDR 09/07/2005 23

Contributions to MPICH-V

User Application

CommunicationVirtualization

DispatchProcess

Virtualization

Communication library Checkpoint/Restart library

Condor libckpt.aTCP Sockets

HDR 09/07/2005 24

Pessimistic message logging on Channel Memories • Decoupled communication

– Dedicated reliable nodes support Channel Memory servers

– CMs log all messages in FIFO order– Conceptually transactional put/get– A restarted process transparently replays all

communications

• Consistent execution based on partial restarts

• Adaptive to heterogeneous fault behaviour: independent scheduling of process checkpoints

• Tunneling as a byproduct

put

get

CM

MPI process

MPI process

HDR 09/07/2005 25

The MPICH-CM library

ADI

ChannelInterface

CM deviceInterface

MPI_Send

MPID_SendControlMPID_SendChannel

_cmbsend

_cmbsend

_cmbrecv

_cmprobe

_cmfrom

_cmInit

_cmFinalize

- get the src of the last message

- check for any message avail.

- blocking send

- blocking receive

- initialize the client

- finalize the client

ChameleonInterfacePIbsend

Blocking TCP

Read WriteControl + data messages

HDR 09/07/2005 26

Communication overhead

put

get

CM

MPI process

MPI process

x2

Bounded by

node bandwidth

HDR 09/07/2005 27

Towards a hybrid approach

• Coupled applications have – A hierachical structure– Differentiated requirements

• Asynchronous iterations [Bertsekas] also apply to faults

• A reliable tunneling infrastructure is required anyway

• Requires re-coding even of the innermost loop for non-trivial applications eg multi-grid

P0

P1

P2

P3

WANWAN

Self-stabilizing

Self-stabilizing

Fault-tolerant

HDR 09/07/2005 28

Publications

• A.Selikhov and C.Germain. A channel memory based environment for MPI applications. Future Generation Computing Systems, 21(5):709-715, 2004.

• A. Selikhov, G. Bosilca, C. Germain, G. Fedak and F. Cappello. MPICH-CM: a Communication Library Design for a P2P MPI Implementation. In 9th Euro PVM/MPI Conf., LNCS 2474, pages 323-330, Vienna, Oct. 2002. Springer-Verlag.

• G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Hérault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri and A. Selikhov. MPICH-V : Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In IEEE/ACM Int. Conf. for High Performance Computing and Communications 2002 (SC'02 - SuperComputing'02), Baltimore, 2002

HDR 09/07/2005 29

Summary

• Introduction





• Perspectives

HDR 09/07/2005 30

Medical image analysis exemplifies the need for…

Id Owner Submitted ST PRI Class Running Onf01n01.10873.0 qzha 5/19 07:34 R 50 fewcpu f11n07 f01n03.6292.0 agma 5/22 14:50 R 50 standard f12n02 f01n03.6293.0 publ 5/22 16:16 R 50 standard f03n09 f01n03.6304.0 agma 5/22 22:46 R 50 standard f11n05 f01n03.6309.0 agma 5/23 12:41 R 50 standard f01n11 f01n01.10914.0 ying 5/23 14:17 R 50 fewcpu f06n03f01n02.4596.0 dpan 5/23 15:33 I 50 standardf01n03.6310.0 divi 5/23 16:03 I 50 standard • Seamless integration of grid resources with local tools: analysis, graphics,…

• Unplanned access to high-end computing power and data• Interactivity • But convergence with many other areas

HDR 09/07/2005 31

gPTM3D

• grid-enabling the PTM3D software• PTM3D: poste de travail médical 3D

– A.Osorio & team – LIMSI Clinical use: « cum laude » RSNA 2004– Complex interface: optimized graphics and medically-oriented interactions– Expert interaction is required at and inside all steps– But 3D medical data may be very large – 1GB and computations too

Interaction

RenderExplore Analyse InterpretAcquire

HDR 09/07/2005 32

gPTM3D

• grid-enabling the PTM3D software on a production grid• PTM3D: poste de travail médical 3D

– A.Osorio & team – LIMSI Clinical use: « cum laude » RSNA 2004– Complex interface: optimized graphics and medically-oriented interactions– Expert interaction is required at and inside all steps– But 3D medical data may be very large – 1GB and computations too

Interaction

RenderExplore Analyse InterpretAcquire

HDR 09/07/2005 33

EGEE Computing Resources – April 2004

From The project status slides 1st EGEE review

Country providing resourcesCountry anticipating joining EGEE/LCG

In EGEE-0 (LCG-2): > 130 sites > 14,000 CPUs > 5 PB storage

70 leading institutions 27 countries, federated in regional Grids

~32 M Euros EU funding for first 2 years

HDR 09/07/2005 34

Interactive volume reconstruction

• gPTM3D first results– Optimal response time for volume reconstruction on EGEE– With unmodified interaction scheme

• Demonstrated at the first EGEE review

HDR 09/07/2005 35

Figures for volume reconstruction

Small body

Medium body

Large body

Lungs

Dataset

87MB

210MB

346MB

87MB

Input data

3MB18KB/slice

9.6 MB25KB/slice

15MB22KB/sclice

410KB4KB/slice

Output data

6MB106KB/slice

57MB151KB/slice

86MB131KB/slice

2.3MB24KB/slice

Tasks

169

378

676

95

StandaloneExecution

5mn15s1mn54s

33mn11mn5s

18mn

36s

EGEE

.

37s18s

2mn30s1mn15s

2mn03

24s

HDR 09/07/2005 36

Interactive jobs on a grid: a scheduling problem• Short Deadline Job

– A moldable application – individual tasks are very fine-grained– Soft deadline– No reservation: should be executed immediately or rejected

• Sharing contract– Bounded slowdown for regular jobs – Do not degrade resource utilization – No stong preemption– Fair share across SDJ

• Contexts– (multi) Processor soft real-time scheduling– Network routing Differentiated Services

HDR 09/07/2005 37

Scheduling SDJ

User Interface

Broker

User Interface

Broker

Clus

ter

Sche

dule

r

JSSCE

Node Permanent reservationon virtual processorsTransparent when unused

Interaction Bridge

User Interface

Job submissionProxy Tunneling

Interaction Bridge

Matchmaking

HDR 09/07/2005 38

Scheduling SDJ

User Interface

Broker

User Interface

Broker

Clus

ter

Sche

dule

r

JSSCE

Node Permanent reservationon virtual processorsTransparent when unused

Interaction Bridge

Interaction Bridge

User Interface

Task prioritizationTP

Matchmaking

HDR 09/07/2005 39

Scheduling tasks

Node

Interaction Bridge

TP

•Coping with the submission penalty

– N tasks each with small latency T•Potential completion bandwidth T-1

•Impaired by the submission protocol

–A case for application-level scheduling

SchedulingAgent

Workeragent

HDR 09/07/2005 40

Publications

• C. Germain, R. Texier and A. Osorio. Interactive Volume Reconstruction and Measurement on the Grid. Methods of Information in Medecine, 44(2) 227-232, 2005.

• C. Germain, R. Texier and A. Osorio. Interactive Exploration of Medical Images on the Grid. In Procs. 2nd european HealthGrid Conference, Clermont-Ferrand, Jan. 2004.

• C. Germain, A. Osorio and R. Texier. A Case Study in Medical Imaging and the Grid. In S. Norager, editor, Procs. 1st European HealthGrid Conference, pages 110-118, Lyon, Jan. 2003. EC-IST.

• D. Berry, C. Germain-Renaud, D. Hill, S. Pieper and J. Saltz. Report on the Workshop IMAGE'03: Images, medical analysis and grid environments. TR UKeS-2004-02, UK National e-Science Centre, , Feb. 2004.

HDR 09/07/2005 41

AGIR: Analyse Globalisée des Données d’Imagerie Radiologique

• A multidisciplinary research network funded by ACI Masses de Données

• Advances in medical imaging algorithms and their use – Image processing: raw computing/data power– Sharing data and algorithms: evaluation is a

major issue– From algorithmic research to clinical practice

• Identify and explore new services and mechanisms required by medical imaging

HDR 09/07/2005 42

PartnersAlGorilleCRANCompression

LPC EGEE VO BiomedCHRU Clermont Collaborative medecine

CREATISSegmentation 4D

Rainbow Software componentsEpidaureMedical ImagingCentre Antoine Lacassagne

LRI – coll LALLIMSI - St Anne Tenon FMPInteraction & Grids

14 CS6 physicians6 Phd4 engineers

Collaborations

EGEE

Grid5000

CNRS-STIC

CNRS-IN2P3

INRIA

INSERM

Hospitals

HDR 09/07/2005 43

A cross-section of AGIR

QVAZM3D Compression Partially reliable

transport protocol

Gold standard

Consensus

Bronze standard

Automatic

Nodules CAD

Registration

Algorithms

Evaluation

SPIHT Compression

Network emulation

Evaluation

ADOC

gPTM3D Volume

Reconstruction

PTM3D

calibration

Evaluation

Grid-enabled

Workflow

HDR 09/07/2005 44

More in

• C. Germain, V. Breton, P. Clarysse, Y. Gaudeau, T. Glatard, E. Jeannot, Y. Legré, C. Loomis, J. Montagnat, J-M Moureaux, A. Osorio, X. Pennec and R. Texier. Grid-enabling medical image analysis. In Procs. 3rd BioGrid'05,Cardiff, Mai 2005. IEEE Press.

http://www.aci-agir.org

http://www.aci-agir.org/






HDR 09/07/2005 45

Summary

• Introduction





• Conclusions & Perspectives

HDR 09/07/2005 46

Conclusion

Anatomy Physiology

Ecology

•Message-passing: fault-tolerance, performance and technology constraints point toward the same direction•Revisited result-checking: Monte-Carlo computations

•Message-passing: fault-tolerance, performance and technology constraints point toward the same direction•Revisited result-checking: Monte-Carlo computations

•Compatiblity of soft-state protocols with high troughput

•Compatiblity of soft-state protocols with high troughput

•An architecture for differentiated services•A grid testbed for algorithmic and clinical research in 3D medical imaging

•An architecture for differentiated services•A grid testbed for algorithmic and clinical research in 3D medical imaging

HDR 09/07/2005 47

Perspectives

• Data of interestInteractive grid access requires intelligent prefetch mechanisms to capture and anticipate the way data are explored and analyzed– Automatic selection– A model for describing the resulting requirements and

propagate them to the data source– Optimised access schemes in relation with the structure of

the raw data– Progressivity

HDR 09/07/2005 48

Perspectives

• Towards a grid observatoryOptimizing grid middleware and applications requires trace and models for– Intrinsic characterization of « grid traffic »: eg the data

locality parameters at a computing element– The reaction of the middleware components to these

requirements: eg hits and misses– Spatio-temporal correlation of users - VO– To which extent the latter explains the formers– MAGIE and DEMAIN projects

HDR 09/07/2005 49

Questions

contributions to the modelisation and optimisation of large scale distributed computing

Documents

high throughput

computer c

computer b

computer graphics

calcul global et pair

summaryintroductiona

performance modelapplications

collaborator state