david loureiro - presentation at hp's hpc & osl tes
DESCRIPTION
David Loureiro, SysFera CEO, talks about "Managing large-scale, heterogeneous infrastructures: from DIET to SysFera-DS" at HP's High Performance Computing and Open Source & Linux Technical Excellence Symposium that took place on the 19-23 March, 2012, in Grenoble, France.TRANSCRIPT
Managing large scale and heterogeneous infrastructures : From DIET to SysFera-DS
David Loureiro - Eddy Caron
SysFeraEcole Normale Supérieure de Lyon
GRAAL/AVALON Research Team
Distributed Interactive Engineering Toolbox
Outline
ContextFrom DIET…… to SysFera-DSConclusion
2
Why Large Scale systems?
First need: supercomputing at a national or international scale
Large size problems (grand challenge) need a collaboration between several codes/supercomputing centers
Always a need for more computing power, memory capacity, and disk storage
The power of any single resource is always small compared to the aggregation of several resources
Network connectivity increased quickly!• Many available resources– Many clusters– Supercomputers– Millions of PC and
workstations connected– Sharing or renting resources
• Increasing complexity of applications– Multi-scale– Multi-disciplinary– Huge data set produced– Heterogeneity
From DIET to SysFera-DS 3
Centralized or Decentralized ?
Centralized!
From DIET to SysFera-DS
(De)Centralized!
1946 ENIAC• 18.000 tubes, 30 tons, 170 m²• 2.000 tubes replaced every
months by 6 technicians
Decentralized!
1997 Google Cluster
Sky Computing2002 Earth Simulator• First computer to reach the Teraflops (40TF)• Homogeneous, Centralized, Expensive
2008 IBM Roadrunner• First computer to reach
the Petaflops
Cloud Computing• Amazon• Google• Microsoft• …
2001 TeraGrid / 2003 Grid’5000• Grid Computing
(Clusters of Clusters)
Centralized!Decentralize
d!
4
Research driven by applications
Data-centric applicationsVery Large data management (in, out, temporary)
Computer-centric applicationsGigaFlops
Community-centric applicationsData sharing (acquisition, results, ..)Resources
TeraFlop
s
PetaFlopsExaFlop
s
>30 TB data/night
Predicting Impacts of Massive Earthquakes (SDSC)
Large Hadron Collider (LHC)
I just need my simulation resultWithout an optimal scheduling? Without minimizing ressources consumption?Without any optimisation? …
Grid user point of view Single sign-onSingle compute spaceSingle data spaceSingle development
environmentFrom DIET to SysFera-DS 5
Which framework ? Holy Grail: Transparency and simplicity (maybe even before
performance) ! Scheduling tunability
Many incarnations of the Grid Grid computing Cluster computing Global computing
Many programming models Shared-State Models Message Passing Models, RPC and RMI models
Do not forget good ol’ time research on scheduling and distributed systems ! Most scheduling problems are very difficult to solve even in
their simplistic form … … but simple solutions often lead to better performance results
in real life
From DIET to SysFera-DS
peer-to-peer systems, Web Services, Clouds, …
Hybrids models Peer-to-peer models Web Services models Coordination models, …
6
Outline
ContextFrom DIET…… to SysFera-DSConclusion
7
DIET’s Goals
Our goals To develop a toolbox for the deployment of environments using the
Application Service Provider/Software as a Service (ASP/SaaS) paradigm with different applications
Use as much as possible public domain and standard software To obtain a high performance and scalable environment Implement and validate our more theoretical results
Scheduling for heterogeneous platforms, data (re)distribution and replication, performance evaluation, algorithmic for heterogeneous and distributed platforms, …
Based on CORBA and our own software developments FAST for performance evaluation, LogService for monitoring, VizDIET for the visualization, GoDIET for the deployment Dagda for the data management
Several applications in different fields (simulation, bioinformatics, …)
Release 2.8 available on the web since november ACI Grid ASP, RNTL GASP, ANR LEGO CIGC-05-11, ANR Gwendia,
Celtic-plus Project SEED4C
http://graal.ens-lyon.fr/DIET/
From DIET to SysFera-DS 8
RPC and Grid-Computing: Grid-RPC
• One simple idea – Implementing the RPC programming model over the grid – Using resources accessible through the network– Mixed parallelism model (data-parallel model at the server
level and task parallelism between the servers)
• Features needed– Load-balancing (resource localization and performance
evaluation, scheduling), – IDL, – Data and replica management, – Security, – Fault-tolerance, – Interoperability with other systems,– …
Design of a standard interface – within the OGF (Grid-RPC and SAGA WG)– Existing implementations: NetSolve/GridSolve, Ninf, DIET,
OmniRPC
From DIET to SysFera-DS 9
RPC and Grid Computing: Grid-RPC
AGENT(s)
S1 S2 S3 S4
A, B, C
Answer (C)
S2 !
Request
Op(C, A, B)
Client
From DIET to SysFera-DS 10
Client and server interface Client side
So easy … Multi-interface (C, C++, Fortran,
Java, Python, Scilab, Web Services, etc.)
Grid-RPC compliant Server side
Install and submit new server to agent (LA)
Problem and parameter description Client IDL transfer from server Dynamic services
new service new version security update outdated service Etc.
From DIET to SysFera-DS 11
Architecture overview
From DIET to SysFera-DS 12
Workflow Management
Workflow representation Direct Acyclic Graph (DAG)
Each vertex is a task Each directed edge represents
communication between tasks Functional workflows
Loops, if statements, automatic parallelism, fault-tolerance
Goals Build and execute workflows Use different heuristics to solve
scheduling problems Extensibility to address multi-
workflows submission and large grid platform
Manage heterogeneity and variability of environment
ANR Gwendia Language definition (MOTEUR & MADAG) Comparison on Grid’5000 vs EGI
Contribution to the management of large scale platforms: the DIET experience
Idle time Data transfert
Execution time
EGI (Glite) 32.857s 132.143 s 274.643 s
Grid’5000 (DIET)
0.214s 3.371 s 540.614 s 13
DIET Scheduling: Plug-in Schedulers SeD level
Performance estimation function Estimation Metric Vector - dynamic collection of performance
estimation values Performance measures available through DIET
FAST-NWS performance metrics Time elapsed since the last execution CoRI (Collector of Resource Information)
Developer defined values
Aggregation Methods Defining mechanism to sort SeD responses: associated with the
service and defined at SeD level Tunable comparison/aggregation routines for scheduling Priority Scheduler
Performs pairwise server estimation comparisons returning a sorted list of server responses;
Can minimize or maximize based on SeD estimations and taking into consideration the order in which the request for those performance estimations was specified at SeD level.
From DIET to SysFera-DS 14
CoRI-Easy Collector
FAST Collector
CoRI Manager
Other Collectors like
Ganglia
DIET Scheduling: Performance estimation
Collector of Resource Information (CoRI) Interface to gather performance information Currently 2 modules available
CoRI Easy FAST (Martin Quinson’s PhD) Sigar, GPU, etc to come…
From DIET to SysFera-DS
Measured Estimated
Max. error: 14,7 %Avg. error: 3,8 %
Extension for parallel program• Code analysis / FAST calls
combination• Allow the estimation of parallel
regular routines (ScaLAPACK-like)
15
Data Management
Three approaches for DIET DTM (LIFC, Besançon)
Hierarchical and distributed data manager Redistribution between servers
JuxMem (Paris, Rennes) P2P data cache
DAGDA (IN2P3, Clermont-Ferrand and LIP) Joining task scheduling and data management
From DIET to SysFera-DS
Standardized through GridRPC OGF WG.• Data Arrangement for Grid and
Distributed Applications Explicit data replication: Using the
API. Implicit data replication. Data replacement algorithm: LRU,
LFU AND FIFO Transfer optimization by selecting
the more convenient source. Storage resources usage
management. Data status backup/restoration. 16
Parallel & sequential jobstransparent for the usersystem dependent submission
SeDBatch
Many batch systemsBatch schedulers behaviour Internal scheduling process
Monitoring & Performance prediction Simulation (Simbatch)
Parallel and batch submissions
MA
LA
SeD
OAR
OGELSF
PBS
Loadleveler
SeDBatch
SeD//
NFS
6/03/12 From DIET to SysFera-DS
SLURM
DIET Cloud
Inside the CloudDIET platform is virtualized
inside the cloud. (as Xen image for example)
Very flexible and scalable as DIET nodes can be launched
Scheduling is more complexDIET as a Cloud manager
Eucalyptus interfaceEucalyptus is treated as a new Batch SystemProvide a new implementation for the BatchSystem
abstract class
From DIET to SysFera-DS 18
Building a nation wide experimental platform for Grid & P2P researches (like a particle accelerator for the computer
scientists) 9 geographically distributed sites hosting clusters with 256 CPUs to
1K CPUs) All sites are connected by RENATER (French Res. and Edu. Net.) Design and develop a system/middleware environment for safely test and
repeat experiments Use the platform for Grid experiments in real life conditions 4 main features:
A high security for Grid’5000 and the Internet, despite the deep reconfiguration feature
Single sign-on High-performance LRMS: OAR A user toolkit to reconfigure the nodes and monitor experiment: Kadeploy
From DIET to SysFera-DS
Grid’5000
Grid’5000
DIET deployment over a maximum of processors 1 MA, 8 LA, 540 SeDs 1120 clients on 140 machines DGEMM requests (2000x2000 matrices) Simple round-robin scheduling
19
Applications: 4 of them
From DIET to SysFera-DS
Cosmology Application
• Dark Mater Halos • Large Scale experiment on Grid’5K
• Experiment between Italia and France
Robotic Application
Climatology Application
• Forecasting of the world's environment and climate on regional to global scales
• Plug-in Scheduler
Bioinformatics Application
• BLAST• 40000 requests over 5 databases of different
sizes (from 1 to 5 GB)• Data management optimized
20
Conclusions
http://www.grid5000.org/
Grid-RPC Interesting approach for several applications Simple, flexible, and efficient Many interesting research issues (scheduling, data management,
resource discovery and reservation, deployment, fault-tolerance, …)
DIET Scalable, open-source, and multi-application platform Concentration on several issues like resource discovery,
scheduling (distributed scheduling and plugin schedulers), deployment (GoDIET and GRUDU), performance evaluation (CoRI), monitoring (LogService and VizDIET), data management and replication (DTM, JuxMem, and DAGDA)
Large scale validation on the Grid’5000 platform A middleware designed and tunable for different applications
From DIET to SysFera-DS 21
Results
A complete Middleware for heterogeneous infrastructureDIET is light to use and non-intrusiveDedicated to many applicationsDesigned for Grid and Cloud Efficient even in comparison to commercial toolsDIET is high tunability middlewareUsed in production
The DIET Team
SysFera Compagny (14 persons today) http://www.sysfera.com
From DIET to SysFera-DS 22
Future Prospects
From DIET to SysFera-DS 23
Do we need application specific schedulers ?Scheduling based on Economic Model for Cloud
PlatformDIET Green (Collaboration with RESO)
Increase the DIET capacity to deal with heterogeneous resourcesSingle System Image Cluster OSBox ClusterVirtual MachinesGPU architectureMulti-coreLarge scale architecture…
Outline
ContextFrom DIET…… to SysFera-DSConclusion
24
Who are we?
• 2001: Research project from the Graal team (Inria/ENS)– DIET: grid middleware
• 2007: SysFera-DS used within the Décrypthon project– Used in production– Selected by IBM to replace Univa-UD
• 2010: Creation of SysFera, INRIA spin-off• 2012: A team of 14 (R&D: 4 engineers and 5 PhD)
– Supported by two experts from INRIA and ENS– SysFera-DS
DécrypthonHPC management & mutualization
LILLE
JUSSIEU
BORDEAUX
ORSAY
Before SysFera-DS: • Local usage of
resources• No unique
submission interface
• 5 sites, 2 different batch schedulers
LoadLeveler
LoadLeveler LoadLeveler
LoadLeveler
LYON
OAR + Stockage
JUSSIEU
BORDEAUX
ORSAY
With SysFera-DS: • Resources mutualization• Web interface for
submission• Application specific
scheduling• Data management
Site Web de soumission
• Hardware failures hidden from theusers (automaticre-submission)
LoadLeveler
LoadLeveler LoadLeveler
LILLE
LoadLeveler
LYON
OAR + Stockage
DécrypthonHPC management & mutualization
Helping cure muscular distrophy« The Décrypthon Steering Commitee chose SysFera-DS starting on June 2007 for its qualities of robustness and modularity. It has been progressively implemented on the Décrypthon grid's ressources while ensuring a completely transparent and smooth transition for the users. » Thierry Toursel
Research Project Manager, AFM
EDF - Distributed platforms are complex
EDF - The solution
Working with a leading international company
Thanks to SysFera-DS, we can now provide our R&D engineers a stable, reliable and performant solution to access our supercomputers and computing clusters.
David BatemanICCOS Group Manager, EDF
SysFera-DS does it all
• Simple access to complex infrastructures• Advanced administration features
– User management and access control– Monitoring and reporting
• Consistent platform for application development• Integration to existing environments• Compatibility with many different resources• Non-intrusive, non-exclusive• Flexible, stable, reliable, performant
Keys benefits
Workflow & dataflow mangement & design
Efficient Management
Big Data
Collaborative Webboard
Hybrid Cloud
Heterogeneous applications
management
Offers • A software to optimize your computations• A licence to plug inside your software• Your applications migration • A webboard to manage your applications & infrastructures• Skilled competences to support these tools• Skilled competences to develop dedicated plugins
Your Soft
Our Software
Your infrastucture
Your applications
Our SoftwareOur
Software
CIMENT
Pool ressourcesCLOUD …
YourApplications
Offers
DIET« to optimize your computations & integrate your
infrastructures »
Vishnu« A set of dedicated plugins – infrastructure management »
Webboard « To manage
your applications »
YourApplications
Webboard « To manage
your applications »
YourApplications
Features overview• Meta-scheduling (load balancing), workflows management,
jobs management, data management• Resources and communications management• Launch and monitoring of jobs, file transfers, hardware and
software infrastructure through a scientific portal• User management with single sign-on• Cross network domain• Advanced and fine-grained data management• Automatic management of dynamic resources• Maintenance management• Easy deployment• Usable in user space: no need to be root• Cloud management
The WebBoard (Before SysFera)User and admin interface One app - one page
User rights managementStatistics
SysFera-DS WebBoard
Outline
• Context• From DIET…• … to SysFera-DS• Conclusion
39
An open source solution
10 avril 2023 ANR-SOP
40
The core of SysFera-DS is open-source software...
...which means anyone can use it, share it, and contribute to it.
LIPSysFera
MIS, CNRS, ENSI, ENSHEEIT, LIFC, IRISA,…
DIET Open Source SysFera-DS
Conclusion• An open source solution with two different kind of
collaborated support
SysFera-DSSysFera- Application support with industrial quality- Platfom development- New features- Personnal features - Research Grid to Production Grid- Hotline
DIETLIP - Avalon Team
- Proof of concept - Simulations- New features- Grid’5000 experiments- Scientific expertise- etc.
Acknowledgment Abdelkader Amar
Adrian Muresan
Alan Su
Amine Bsila Andréea Chis
Antoine Vernois
Barbara Walter
Benjamin Depardon
Benjamin Isnard
Bert Van Heukelom
Bruno DelFabro
Christophe Pera
Cyril Pontvieux
Cédric Tedeschi
Damien Reimert-Vasconcellos
Daouda Traore
David Loureiro
Eric Boix
Eugene Pamba Capochichi Emmanuel Quémener
Florent Rochette Frédéric Desprez Frédéric Lombard
Frédéric Suter
Gaël Le Mahec
Georg Hoesch Ghislain Charrier Haïkel Guemar Ibrahima Cissé Jean-Marc Nicod Jonathan Rouzaud-
Cornabas Kevin Coulomb Laurent Philippe Ludovic Bertsch Luis Rodero-Merino Marc Boury Martin Quinson Mathias Colin Mathieu Jan Maurice Djibril Faye
43
Nicolas Bard Ousmane Thiare Peter Frauenkron Philippe Combes Philippe Martinez Philippe Vicens Phuspinder Kaur
Chouhan Raphaël Bolze Romain Lacroix Stéphane Vialle Sylvain Dahan Vincent Pichon Yves Caniou
Questions ? http://graal.ens-lyon.fr/DIET
http://www.sysfera.comhttp://blog.sysfera.com
David Loureiro (SysFera CEO): - [email protected] - @DavidLoureiroFr - www.sysfera.com