lattice qcd clusters

1

Lattice QCD Clusters

Amitoj Singh

Fermi National Accelerator Laboratory

2

Introduction The LQCD Clusters Cluster monitoring and response Cluster job

types submission, scheduling and allocation Execution

Wish List Questions and Answers

3

The LQCD Clusters

Cluster

Processor Nodes MILCperformance

qcd 2.8 GHz P4E,Intel E7210 chipset,1 GB main memory,Myrinet

127 1017 MFlops/node

0.1 TFlops

pion 3.2 GHz Pentium 640,Intel E7221 chipset,1 GB main memory,Infiniband SDR


0.8 TFlops

kaon 2.0 GHz Dual Opteron,nVidia CK804 chipset,4 GB main memory,Infiniband DDR


2.2 TFlops

4

pion and qcd cluster

pion cluster front pion cluster back qcd cluster back

5

kaon cluster

kaon cluster front kaon cluster back kaon head-nodes & Infiniband spine

6

Cluster monitoring Worker node

nannies monitor critical components/processes such as: health (cpu/system temperature, cpu/system fan speeds) batch queue clients (PBS mom) * disk space NFS mount points high speed interconnects

Except for * nannies report via email any anomalies that may exist. For * a corrective action is defined. A corrective action needs to be well-defined with sufficient decision paths to fully automate the error diagnosis and recovery process. Users are sophisticated enough to report any performance related issues.

Head-node nanny monitors critical processes such as:

mrtg graph plotting scripts * automated scripts to generate cluster status pages * batch queue server (PBS server) NFS server *

Except for * nanny will restart processes that may have exited abnormally. All unhealthy nodes are reported as blinking on the cluster status pages. Cluster administrators can then analyze the mrtg plots to isolate the problem.

Network fabric For the high speed network interconnects:

Nannies monitor and plot health of critical components (switch blade temperature, chassis fan speeds) on the 128 port myrinet spine switch. No automated corrective action has been defined for any anomalies that may occur.

Cluster administrators can run Infiniband cluster administration tools to locate bad Infiniband cables, failing spine or leaf switch ports, failing Infiniband HCAs. The Infiniband hardware has been reliable.

7

Cluster job types A large fraction of the jobs that are run on the LQCD

clusters are limited by: Memory-bandwidth Network-bandwidth

Memory bandwidth bound Network bandwidth bound

8

Cluster job execution Open PBS (Torque) and the Maui scheduler schedule jobs using

the "FIFO" algorithm as follows:

Jobs are queued in the order of submission

Maui will run the highest (oldest) jobs in the queue in order, except it will not start a job if any of the following are true:

a) the job will put the number of running jobs by a particular user over the limitb) the job will put the total number of nodes used by a particular user over the limitc) the job specifies resources that cannot be fulfilled (e.g. a specific set of nodes

requested by the user)

If there are jobs that are not eligible for any of the above, Maui will run the next eligible job.

Under certain conditions, Maui may run the next eligible job if only limit (c) holds. This is called backfilling. Maui will look at the state of the queue and the running jobs, and based on the requested and used wall-clock times predict when the job blocked by (c) will be able to run. If job(s) lower in the queue can run without extending the start time for the job blocked by (c), Maui will run that (those) jobs.

Once a job is ready to run, a set of nodes are allocated to the job exclusively, for the requested wall-time. Almost all jobs run on the LQCD clusters are MPI jobs. Users can explicitly refer to the PBS_NODEFILE environment variable OR it is coded into the mpirun launch script.

9

Cluster job execution (cont’d) Typical user jobs are 8, 16 or 32 nodes which run for a maximum wall time of 24

hours.

A user nanny job running on the head-node executes job streams. Each job stream is a PBS job which:

on the job head-node (MPI node 0) copies a lattice (problem) stored in dCache to the local scratch disk.

divides the lattice into the number of nodes and copies the sub-lattices to each node local scratch disk.

launches an MPI process on each node which computes it’s sub-lattice. the main process (MPI process 0) gathers the results from each node onto the job head-

node (MPI node 0) and copies the output into dCache. marks checkpoints at regular intervals for error recovery.

Output from one job stream is the input lattice for the next job stream.

If a job stream fails, the nanny job restarts the stream from the most recent saved checkpoint.

10

Wish List Missing link between the monitoring process and the

scheduler. Scheduler could do better by being node and network aware.

Ability to monitor factors that are critical to application performance (e.g. Thermal instabilities can cause throttling of cpu speed which ultimately affects performance).

Very few automated corrective actions defined for components and processes that are currently being monitored.

Using current health data, ability to predict node failures rather than just updating mrtg plots.

lattice qcd clusters

Documents

eligible job

limitthe job

cluster administrators

running jobs

mpi jobs

jobs lower

cluster status pages

maui scheduler schedule