large scale hpc workflow management using pbs professional...

33
Large scale HPC workflow management using PBS professional in a National Academic Computing Center Gérard GIL (CINES) 29 / 10 / 2010

Upload: phungcong

Post on 29-Mar-2018

229 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

Large scale HPC workflow management using

PBS professional in a National Academic Computing Center

Gérard GIL (CINES) 29 / 10 / 2010

Page 2: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 EHTC 2010 - Gérard GIL (CINES) 2

Outline

Presentation of CINES

Missions / activities

Organisation of French HPC

CINES’s HPC resources and services

Workflow Management

Existing environment

PBSpro implementation

Statistics

Next

Page 3: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 EHTC 2010 - Gérard GIL (CINES) 3

Outline

Presentation of CINES

Missions / activities

Organisation of French HPC

CINES’s HPC resources and services

Page 4: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

CINES is supervised and funded

by the Ministry of Higher Education

and Research.

29/10/2010 EHTC 2010 - Gérard GIL (CINES) 4

CINES is located in Montpellier (South of France)

30 years serving the National Academic Research

CINES provides the french public research community

with computing resources and services.

50 persons :

(technicians, engineers and administratives)

Presentation of CINES : Missions / activities

Page 5: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

2 MISSIONS

High Performance Computing

Digital preservation

29/10/2010 5 EHTC 2010 - Gérard GIL (CINES)

Presentation of CINES : Missions / activities

Page 6: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

Digital preservation

29/10/2010 6 EHTC 2010 - Gérard GIL (CINES)

Since 2004 the CINES was given the mandate to provide long-term preservation capabilities for digital objects related to scientific and technical information

International certification process:

CINES is one of the three pilot sites (UKDA, DNb) to test

European Certification Framework for long term preservation

supported by European Commission (iso certification)

Presentation of CINES : Missions / activities

National agreement process:

CINES is in process of agreement by the “Archive Nationale”

Page 7: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

Electronic PhD thesis

29/10/2010 7 EHTC 2010 - Gérard GIL (CINES)

French Ministry of higher education referred CINES as

the national center for long-term conservation of digital thesis

Presentation of CINES : Missions / activities

Digital preservation

Page 8: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

Electronic PhD thesis

Digitized publications

29/10/2010 8 EHTC 2010 - Gérard GIL (CINES)

• TGE-ADONIS : Multimedia documents for the

Research Center on Oral Resources (CRDO)

• Liber floridus : medieval manuscripts of universities libraries

Mazarine, St Geneviève, IRH, …

• …

Presentation of CINES : Missions / activities

Digital preservation

Page 9: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

Electronic PhD thesis

Digitized publications

Pedagogic multimedia

29/10/2010 9 EHTC 2010 - Gérard GIL (CINES)

• CANAL_U : videos of University channels (for CERIMES)

• …

Presentation of CINES : Missions / activities

Digital preservation

Page 10: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

Electronic PhD thesis

Digitized publications

Pedagogic multimedia

Scientific datasets

29/10/2010 10 EHTC 2010 - Gérard GIL (CINES)

A new domain directly linked to our mission in HPC

Presentation of CINES : Missions / activities

Digital preservation

Page 11: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 11 EHTC 2010 - Gérard GIL (CINES)

High Performance Computing

National HPC Center since 1980

Scalar, vector and parallel processing

Accelators : GPU, CELL, FPGA

Presentation of CINES : Organization of French HPC

Page 12: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 EHTC 2010 - Gérard GIL (CINES) 12

French HPC coordination

Presentation of CINES : Organization of French HPC

Page 13: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 EHTC 2010 - Gérard GIL (CINES) 13

Main missions of

Coordinate national academic supercomputing centers (civil activities)

Promote the European HPC and participate to its organization

, , European Exascale Software Initiative,…

Promote simulation and HPC in fundamental and industrial research

Give access to its equipments to promote HPC :

HPC-PME initiative : GENCI - OSEO - INRIA

Presentation of CINES : Organization of French HPC

www.genci.fr

Page 14: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

REGIONAL:

middle range centers

NATIONAL:

CCRT, CINES, IDRIS

National computing centers

very large equipments for solving

extremely complex problems

free of charge resources allocated to scientific projects after evaluation

by thematic national committees

www.edari.fr

pyramidal

organization

French academic Equipments for researchers

29/10/2010 14 EHTC 2010 - Gérard GIL (CINES)

EUROPEAN:

PRACE

Presentation of CINES : Organization of French HPC

From 2007 to 2010 : 20 to 700 Tflops

Page 15: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 EHTC 2010 - Gérard GIL (CINES) 15

Main National Equipments

CCRT: CEA

BULL : Novascale xeon system : 143 Tflops

NVIDIA : TESLA GPU system : 192 Tflops

CINES: Universities

SGI : Altix ICE xeon system : 267 Tflops

IDRIS: CNRS

IBM : SP / Power6 system : 68 Tflops

IBM : Blue Gene/P system : 139 Tflops

Presentation of CINES : Organization of French HPC

Page 16: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

JADE : SGI Altix ICE 8200 EX Linpack : 237 Tflops (n° 14 on Top500 , June 2010)

29/10/2010 16 EHTC 2010 - Gérard GIL (CINES)

CINES national HPC resources

Presentation of CINES’s HPC resources and services

Linux Suse SGI Tempo : Cluster administration ALTAIR PBSpro 10.1.4 (patched) : Job scheduler Lustre : scratch files NFS / DMF : /home and archives

Page 17: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 17 EHTC 2010 - Gérard GIL (CINES)

JADE

Presentation of CINES’s HPC resources and services

Page 18: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

JADE Infiniband hypercube topology

29/10/2010 18 EHTC 2010 - Gérard GIL (CINES)

Increase the bisectional bandwidth within the same rack

enhanced standard

Presentation of CINES’s HPC resources and services

4 dimensional Hypercube

Page 19: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

CINES HPC systems

Library : 2 x IBM TS3500

2000 cartridges

7 readers Jaguar 3

7 readers LTO 4

File servers SGI Altix 450 16 Montecito cores / 64 GB

500 To LSI disks storage (DMF)

2 NAS FAS250 : /home

Bull Novascale 20 nodes R422, 2 sockets quad core / 8Go

24 nodes R422-E1 + Tesla S1070 GPU,

Infiniband DDR +Lustre

SGI ICE 8200 EX : 267 Tflops 1536 nodes bi-proc quad core (12288 cores)

Xeon 3 GHz (Harpertown), 32 GB/node

1344 nodes bi-proc quad core (10752 cores)

Xeon 2.8 GHz (Nehalem), 36 GB/node

Infiniband DDR/QDR, 700 TB (Lustre)

BA

CK

BO

NE

10 G

be

29/10/2010 19 EHTC 2010 - Gérard GIL (CINES)

IBM P1600 5 nodes P575 16cores: Power5 /32 GB

Infiniband DDR + GPFS : 0,5 Tflops 4 nodes P755 32cores: Power7 /128GB,

Infiniband DDR + GPFS : 3 Tflops

Pre/post processing resources

Computing resources

Storage resources

Presentation of CINES’s HPC resources and services

Page 20: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 20 EHTC 2010 - Gérard GIL (CINES)

1 - ENVIRONMENT

2 - COMPUTATIONAL FLUID DYNAMICS

3 - BIOMEDICAL SIMULATION & HEALTH APPLICATIONS

4 - ASTROPHYSICS & GEOPHYSICS

5 - THEORICAL PHYSICS & PLASMA PHYSICS

6 - COMPUTER SCIENCES & MATHEMATICS

7 - MOLECULAR SYSTEMS & BIOLOGY

8 - QUATUM CHEMISTRY & MOLECULAR DYNAMICS

9 - PHYSICS & MATERIAL SCIENCE

10 - NEW APPLICATIONS & TRANSVERSE APPLICATIONS

Annual National resource allocation campain : eDARI

Presentation of CINES’s HPC resources and services

2010

Page 21: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

Services for users

User support and HPC expertise CINES offers expertise in profiling,

optimisation and parallelization

Data visualization Hardware, Softwares and expertise

Trainings and workshops Programming, MPI, OpenMP, …

Participation in projects European projects, national and

international collaborations

29/10/2010 21 EHTC 2010 - Gérard GIL (CINES)

Presentation of CINES’s HPC resources and services

Page 22: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 EHTC 2010 - Gérard GIL (CINES) 22

Outline

Presentation of CINES

Missions

Organisation of French HPC

CINES HPC resources and services

Workflow Management

Existing environment

PBSpro implementation

Statistics

Next

Page 23: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 EHTC 2010 - Gérard GIL (CINES)

Workflow Management : Existing environment

23

Existing multi platform tools

Project management

Workload monitoring

Jobs monitoring

Jobs statistics

Accounting process

Today only PBSpro installed from « PBS GridWorks suite »

Page 24: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 EHTC 2010 - Gérard GIL (CINES) 24

Project Management : eDARI

Workflow Management : Existing environment

Page 25: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 EHTC 2010 - Gérard GIL (CINES) 25

Machines workload and availability (admin)

Workflow Management : Existing environment

Page 26: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 EHTC 2010 - Gérard GIL (CINES) 26

Machines workload and availability (users)

Workflow Management : Existing environment

Available on http://www.cines.fr

Page 27: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 EHTC 2010 - Gérard GIL

(CINES)

27

Graphical job monitoring: llview (Juelich Supercomputing Center) (ADMIN)

Workflow Management : Existing environment

Page 28: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 EHTC 2010 - Gérard GIL (CINES)

Workflow Management : PBSpro implementation

28

Selected as « Job Scheduler » within the JADE procurement

Installed since 2008 (v 9.1 v 10.1)

Several adjustements , RFE and bugs corrections to fit our

needs node selection based on topology

accounting feature

large-job management

PBS is configured in order to facilitate resources access to users

no need to know the « queue » configuration

only «walltime» & «number of nodes/cores» are required

specify more to enforce specific selection or placement

« PBS professional » on JADE

Page 29: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 EHTC 2010 - Gérard GIL (CINES)

Workflow Management : PBSpro implementation

29

2 logical partitions to manage the machine heterogeneity

Individual casting of applications, based on efficiency (profiling)

Largest and most efficient jobs routed to « Nehalem »

All other routed to « Harpertown »

One PBSpro «HOOK» to adapt PBS to our needs

Automatic allocation of job resources and priority

Batch access control to Users

Machine selection

users authorization control

parameters controls

Dynamic node pools

Users/job priority : Backfill (Top1), Fairshare, Formula

Large jobs are privileged

« PBS professional » on JADE

Page 30: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 EHTC 2010 - Gérard GIL (CINES) 30

Workflow Management : Statistics

Almost 1 million jobs treated since september 2008

Up to 8000 cores per job usually managed by PBSpro

Jobs walltime up to

24 hours (standard)

120 hours (only if checkpointed)

Most of cpu hours consumed by

jobs over than 512 cores on « Harpertown » partition

jobs over than 2048 cores on « Nehalem » partition

Waiting cores in queue = 10 times the machine size

Machine’s workload close to 90% every month

PBSpro usage: some key numbers

Page 31: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 EHTC 2010 - Gérard GIL (CINES) 31

Workflow Management

Upgrade to V 11

Improve job priority management : FAIRSHARE + FORMULA

Improve workload with « scheduled machine maintenance »

Manage «Service Licence Agreement » for users

PBSpro : next

Page 32: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 EHTC 2010 - Gérard GIL (CINES) 32

Questions ?

Page 33: Large scale HPC workflow management using PBS professional ...altairatc.com/europe/Presentations_2010/Session_12/CINES_Gil/ehtc... · Large scale HPC workflow management using PBS

29/10/2010 EHTC 2010 - Gérard GIL (CINES) 33