www.consorzio-cometa.it fesr consorzio cometa - progetto pi2s2 using mpi to run parallel jobs on the...

Download Www.consorzio-cometa.it FESR Consorzio COMETA - Progetto PI2S2 Using MPI to run parallel jobs on the Grid Marcello Iacono Manno Consorzio COMETA marcello.iacono@ct.infn.it

If you can't read please download the document

Upload: madeline-grant

Post on 22-Dec-2015

214 views

Category:

Documents


2 download

TRANSCRIPT

  • Slide 1
  • www.consorzio-cometa.it FESR Consorzio COMETA - Progetto PI2S2 Using MPI to run parallel jobs on the Grid Marcello Iacono Manno Consorzio COMETA [email protected] TUTORIAL GRID PER I LABORATORI NAZIONALI DEL SUD 26 Febbraio 2008
  • Slide 2
  • Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio 2008 2 Outline Overview Requirements & Settings How to create a MPI job How to submit a MPI job to the Grid
  • Slide 3
  • Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio 2008 3 Currently parallel applications use special HW/SW Parallel application are normal on a Grid Many are trivially parallelizable Grid middleware offers several parallel jobs (DAG, collection) A common solution for non trivial parallelism is: Message Passing Interface (MPI) based on send() and receive() primitives a master node starts some processes slaves by establishing SSH sessions all processes can share a common workspace and/or exchange data Overview
  • Slide 4
  • Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio 2008 4 Several MPI implementations but only two of them are currently supported by the Grid Middleware: MPICH MPICH2 Both old GigaBit Ethernet and new low-latency InfiniBand nets are supported Cometa infrastructure will run MPI jobs on either GigaBit (MPICH, MPICH2) or InfiniBand (MVAPICH, MVAPICH2) Currently, MPI parallel jobs can run inside a single Computing Elements (CE) only several projects are involved into studies concerning the possibility of executing parallel jobs on Worker Nodes (WNs) belonging to different CEs MPI & Grid
  • Slide 5
  • Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio 2008 5 From the users point of view, MPI jobs are specified by setting the JDL JobType attribute to MPICH, MPICH2, MVAPICH, MVAPICH2 specifying the NodeNumber attribute as well JobType = MPICH; NodeNumber = 2; This attribute defines the required number of CPU cores (PEs) Matchmaking: the Resource Broker (RB) chooses a CE (if any!) with enough free Processing Elements (PE = CPU cores) e.g.: free PE# NodeNumber (otherwise wait!) JDL (1/3)
  • Slide 6
  • Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio 2008 6 When these two attributes are included in a JDL script the following expression is automatically added: (other.GlueCEInfoTotalCPUs >= NodeNumber) && Member (MPICH,other.GlueHostApplicationSoftwareRunTimeEnvironment) to the JDL requirements expression in order to find out the best resource where the job can be executed JDL (2/3)
  • Slide 7
  • Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio 2008 7 Executable specifies the MPI executable NodeNumber specifies the number of cores Arguments specifies the WN command line Executable + Arguments form the command line on the WN mpi.pre.sh is a special script file that is sourced before launching MPI executable warning: it runs only on the master node actual mpirun command is issued by the middleware ( what if a proprietary script/bin?) mpi.pre.sh is a special script file that is sourced after MPI executable termination warning: it runs only on the master node JDL (3/3)
  • Slide 8
  • Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio 2008 8 In order to assure that a MPI job can run, the following requirements MUST BE satisfied: MPICH/MPICH2/MVAPICH/MVAPICH2 the MPICH/MPICH2/MVAPICH/MVAPICH2 software must be installed and placed in the PATH environment variable, on all the WNs of the CE some MPI applications require a file system shared among the WNs: no shared area currently available to write user data application may access the area of the master node (requires modifications to the application) middleware solutions are also possible (as soon as required/designed/tested/deployed) Requirements (1/2)
  • Slide 9
  • Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio 2008 9 Requirements (2/2) Job wrapper copies all the files indicated in the InputSandbox on ALL of the slave nodes host based ssh authentication MUST BE well configured between all the WNs If some environment variables are needed ONLY on the master node, they can be set by the mpi.pre.sh If some environment variables are needed ON ALL THE NODES, a static installation is currently required (middleware extension is under consideration)
  • Slide 10
  • Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio 2008 10 [ Type = "Job"; JobType = "MPICH"; Executable = MPIparallel_exec; NodeNumber = 2; Arguments = arg1 arg2 arg3"; StdOutput = "test.out"; StdError = "test.err"; InputSandbox = {mpi.pre.sh,mpi.post.sh, MPIparallel_exec}; OutputSandbox = {test.err, test.out, executable.out}; Requirements = other.GlueCEInfoLRMSType == "PBS" || other.GlueCEInfoLRMSType == "LSF"; ] mpi.jdl Local Resource Manager (LRMS) = PBS/LSF only Local Resource Manager (LRMS) = PBS/LSF only Pre e Post Processing Scripts Executable
  • Slide 11
  • Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio 2008 11 GigaBit vs InfiniBand The advantage of using a low latency network becomes more evident the greater the number of nodes
  • Slide 12
  • Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio 2008 12 CPI Test (1/4) [marcello@infn-ui-01 mpi-0.13]$ edg-job-submit mpi.jdl Selected Virtual Organisation name (from proxy certificate extension): cometa Connecting to host infn-rb-01.ct.pi2s2.it, port 7772 Logging to host infn-rb-01.ct.trigrid.it, port 9002 ********************************************************************************************* JOB SUBMIT OUTCOME The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is: - https://infn-rb-01.ct.pi2s2.it:9000/vYGU1UUfRnSktGODcwEjMw *********************************************************************************************
  • Slide 13
  • Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio 2008 13 CPI Test (2/4) [marcello@infn-ui-01 mpi-0.13]$ edg-job-status https://infn-rb- 01.ct.pi2s2.it:9000/vYGU1UUfRnSktGODcwEjMw ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://infn-rb- 01.ct.pi2s2.it:9000/vYGU1UUfRnSktGODcwEjMw Current Status: Done (Success) Exit code: 0 Status Reason: Job terminated successfully Destination: infn-ce-01.ct.pi2s2.it:2119/jobmanager-lcglsf-short reached on: Sun Jul 1 15:08:11 2007 *************************************************************
  • Slide 14
  • Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio 2008 14 CPI Test (3/4) [marcello@infn-ui-01 mpi-0.13]$ edg-job-get-output --dir /home/marcello/JobOutput/ https://infn-rb- 01.ct.pi2s2.it:9000/vYGU1UUfRnSktGODcwEjMw Retrieving files from host: infn-rb-01.ct.pi2s2.it ( for https://infn-rb- 01.ct.pi2s2.it:9000/vYGU1UUfRnSktGODcwEjMw ) ********************************************************************************* JOB GET OUTPUT OUTCOME Output sandbox files for the job: - https://infn-rb-01.ct.pi2s2.it:9000/vYGU1UUfRnSktGODcwEjMw have been successfully retrieved and stored in the directory: /home/marcello/JobOutput/marcello_vYGU1UUfRnSktGODcwEjMw *********************************************************************************
  • Slide 15
  • Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio 2008 15 CPI Test (4/4) [marcello@infn-ui-01 mpi-0.13]$ cat /home/marcello/JobOutput/marcello_vYGU1UUfRnSktGODcwEjMw/test.out preprocessing script ------------------------- infn-wn-01.ct.pi2s2.it Process 0 of 4 on infn-wn-01.ct.pi2s2.it pi is approximately 3.1415926544231239, Error is 0.0000000008333307 wall clock time = 10.002570 Process 1 of 4 on infn-wn-01.ct.pi2s2.it Process 3 of 4 on infn-wn-02.ct.pi2s2.it Process 2 of 4 on infn-wn-02.ct.pi2s2.it TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME ==== ========== ================ ======================= =================== 0001 infn-wn-01 /opt/lsf/6.1/lin Done 07/01/2007 17:04:23 0002 infn-wn-01 /opt/lsf/6.1/lin Done 07/01/2007 17:04:23 0003 infn-wn-02 /opt/lsf/6.1/lin Done 07/01/2007 17:04:23 0004 infn-wn-02 /opt/lsf/6.1/lin Done 07/01/2007 17:04:23 P4 procgroup file is /home/cometa005/.lsf_6826_genmpi_pifile. postprocessing script temporary [marcello@infn-ui-01 mpi-0.13]$
  • Slide 16
  • Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio 2008 16 MPI on the web.. https://edms.cern.ch/file/454439/LCG-2-UserGuide.pdf http://oscinfo.osc.edu/training/ http://www.netlib.org/mpi/index.html http://www-unix.mcs.anl.gov/mpi/learning.html http://www.ncsa.uiuc.edu/UserInfo/Training
  • Slide 17
  • Catania, Tutorial Grid per i Laboratori Nazionali del Sud, 26 Febbraio 2008 17 Questions