5.1.0 lenovo intelligent computing orchestration...1 overview 1.1 introduction to lico lenovo...
TRANSCRIPT
Lenovo Intelligent Computing Orchestration
5.1.0Installation Guide
Publication:Date: 05/04/2018Version: 1.0
Contents 1 Overview .................................................................................................................................................6
1.1 Introduction to LiCO ..............................................................................................................6
1.2 Operating Environment ........................................................................................................6
1.3 Prerequisites .............................................................................................................................6
1.4 Instructions ...............................................................................................................................8
2 Deploying cluster environment .......................................................................................................9
2.1 Installing an OS on the Management Node ..................................................................9
2.2 Deploying the OS on Other Nodes in the Cluster .......................................................9
2.2.1 Configuring Environmental Variables ..............................................................................9
2.2.2 Get Local Repository ...........................................................................................................11
2.2.3 Installing Lenovo xCAT .......................................................................................................12
2.2.4 Prepare OS Mirrors for Other Nodes .............................................................................13
2.2.5 Set xCAT node information ...............................................................................................13
2.2.6 Add Host Resolution ...........................................................................................................14
2.2.7 Configuring DHCP and DNS Services ............................................................................14
2.2.8 Installing a Node OS through the Network .................................................................15
2.2.9 Checkpoint A .........................................................................................................................15
2.3 Installing Infrastructure Software for Node .................................................................15
2.3.1 List of Infrastructure Software to be installed .............................................................15
2.3.2 Set Local Yum Repository for Management Node ....................................................16
2.3.3 Configuring Local Yum Repository for Compute and Login Nodes ....................16
2.3.4 Configuring LiCO Dependencies Repository ...............................................................17
2.3.5 Installing Slurm ......................................................................................................................17
2.3.6 Configuring NFS ...................................................................................................................18
2.3.7 Configuring NTP ...................................................................................................................19
2.3.8 Installing CUDA .....................................................................................................................19
2.3.9 Configuring Slurm ................................................................................................................21
2.3.10 Installing Ganglia ..................................................................................................................21
2.3.11 Installing MPI .........................................................................................................................22
2.3.12 Installing Singularity ............................................................................................................23
2.3.13 Checkpoint B ..........................................................................................................................23
3 Installing LiCO Dependencies ........................................................................................................25
3.1 List of LiCO Dependencies to be installed ...................................................................25
3.2 Installing RabbitMQ .............................................................................................................25
3.3 Installing PostgreSQL ..........................................................................................................25
3.4 Installing InfluxDB .................................................................................................................26
3.5 Installing Confluent ..............................................................................................................27
3.6 Configuring user authentication ......................................................................................27
3.6.1 Installing OpenLDAP-server .............................................................................................27
3.6.2 Installing libuser ....................................................................................................................28
3.6.3 Installing openldap-client ..................................................................................................29
3.6.4 Installing nss-pam-ldapd ..................................................................................................29
3.7 Installing Gmond GPU Plug-In ........................................................................................30
4 Installing LiCO .....................................................................................................................................31
4.1 List of LiCO Components to be installed ......................................................................31
4.2 Getting the LiCO Installation Package ...........................................................................31
4.3 Configuring the Local Yum Depository for LiCO .......................................................32
4.4 Installing the Management Node ...................................................................................32
4.5 Installing the Login Node ..................................................................................................33
4.6 Installing the Compute Node ...........................................................................................33
5 Configuring LiCO ...............................................................................................................................34
5.1 Configuring Service Account ............................................................................................34
5.2 Configuring Cluster Nodes ................................................................................................34
5.2.1 Room Information ................................................................................................................34
5.2.2 Logic Group Information ...................................................................................................35
5.2.3 Room Row Information ......................................................................................................35
5.2.4 Rack Information ..................................................................................................................35
5.2.5 Chassis Information .............................................................................................................36
5.2.6 Node Information .................................................................................................................36
5.3 Configuring LiCO Services .................................................................................................37
5.3.1 Infrastructure Configuration .............................................................................................38
5.3.2 Database Configuration .....................................................................................................38
5.3.3 Login Configuration .............................................................................................................38
5.3.4 Storage Configuration ........................................................................................................38
5.3.5 Scheduler Configuration ....................................................................................................39
5.3.6 Alert Configuration ..............................................................................................................39
5.3.7 Cluster Configuration ..........................................................................................................39
5.3.8 Functional Configuration ...................................................................................................39
5.4 Configuring LiCO Components .......................................................................................40
5.4.1 lico-vnc-mond ......................................................................................................................40
5.4.2 lico-env ....................................................................................................................................40
5.4.3 lico-portal ...............................................................................................................................40
5.4.4 lico-ganglia-mond ..............................................................................................................41
5.4.5 lico-confluent-proxy ...........................................................................................................41
5.4.6 lico-confluent-mond ..........................................................................................................42
5.4.7 lico-wechat-agent ...............................................................................................................42
5.5 Initializing the System .........................................................................................................43
5.6 Initializing Users ....................................................................................................................43
5.7 Importing System Images ..................................................................................................43
6 Starting LiCO .......................................................................................................................................44
7 Appendix ...............................................................................................................................................45
7.1 Configuring VNC ..................................................................................................................45
7.2 Configuring Confluent web console ..............................................................................45
7.2.1 RHEL .........................................................................................................................................46
7.2.2 CentOS .....................................................................................................................................46
7.3 LiCO commands ...................................................................................................................46
7.3.1 Set the LDAP administrator password ...........................................................................46
7.3.2 Change user’s role ...............................................................................................................46
7.3.3 Resume user ..........................................................................................................................47
7.3.4 Import user .............................................................................................................................47
7.3.5 Import AI image ....................................................................................................................47
7.4 Cluster Service Summary ...................................................................................................47
7.5 Security improvement .........................................................................................................48
7.5.1 Binding setting ......................................................................................................................48
7.5.2 Firewall setting ......................................................................................................................51
7.6 slurm.conf ...............................................................................................................................52
7.7 gres.conf ..................................................................................................................................53
7.8 Chassis Model List ................................................................................................................53
7.9 Product List .............................................................................................................................54
7.10 Import system image ..........................................................................................................54
7.10.1 Create image .........................................................................................................................54
7.10.2 Import images into LiCO as system level image ........................................................55
7.11 Troubleshooting Slurm issues ..........................................................................................55
7.12 Update OS packages ...........................................................................................................56
7.13 Using a newer kernel with RETPOLINE support .........................................................57
1 Overview
1.1 Introduction to LiCO
Lenovo Intelligent Computing Orchestration (LiCO) is an infrastructure management software for high performance computing (HPC) and artificial intelligence (AI). It provides cluster management and monitoring, job scheduling and management, cluster user management, account management, and file system management. With LiCO, users can centralize resource allocation in one supercomputing cluster. The software can support simultaneous HPC and AI jobs. LiCO supports users carrying out operations by logging into the management system interface with a browser, or using command lines after logging into a cluster login node with another Linux shell.
1.2 Operating Environment
Servers:
Lenovo ThinkSystem servers.
Operating System:
Red Hat Enterprise Linux (abbr. RHEL) 7.4/CentOS 7.4
Client:
Browser: Chrome (v. 62.0 or higher) or Firefox (v. 56.0 or higher) is recommended. Display Resolution: 1280 x 800 is recommended
1.3 Prerequisites
Before your installation, please reference LiCO best recipe to make sure the cluster hardware use the proper firmware levels, drivers and settings. You can get the best recipe document from the below link: https://support.lenovo.com/us/en/solutions/ht506408
Before your installation, please refer to the OSes part of LeSI 18A_SI best recipe to install the OS security patch. You can get the best recipe document from the below link: https://datacentersupport.lenovo.com/us/en/solutions/HT506335
The installation described in this Guide is based on CentOS 7.4. The purpose is to have a quick overview of LiCO. For RHEL 7.4, you can follow similar steps.
You can setup CentOS/RedHat base repository (online or local) on management node. Unless stated in this Guide, all commands run on the management node. If you must open firewall, please refer Cluster Service Summary to modify the firewall
rules. The user is responsible for regularly updating the components and OS. It is important to
regularly patch and update components and OS to prevent security vulnerabilities. For how to update OS packages, please refer to chapter 7.12 Update OS packages.
This document is for the typical cluster that contains management, login and compute nodes, as shown in the figure below. However, LiCO also support the cluster only contains management and compute nodes. For this kind of cluster, all the LiCO modules installed on login node need to be installed on management node.
Management node: It is the core of the HPC/AI cluster, undertaking primary functions such as cluster management, monitoring, scheduling, strategy management, and user and account management. Compute node: As the name implies, the compute node completes computing tasks. Login node: The login node connects the cluster to the external network or cluster. Users must use the login node to login to upload application data, to develop compilers and submit scheduled tasks. Parallel File System: Provides a shared storage function. It is connected to the cluster nodes via a high-speed network. Parallel file system setup is outside of scope for this Guide. A simple NFS setup is used instead. Nodes BMC interface: BMC interface is used to access node’s BMC system. Nodes eth interface: Ethernet interface is used to manage the nodes in cluster, it also can be used to transfer computing data. High speed network interface: The high speed network is optional. It is always used to support parallel file system, also can be used to transfer computing data.
1.4 Instructions
This guide is a PDF document. To make sure that you can get the correct command line by coping and pasting, please open this document by Adobe Acrobat Reader. Adobe Acrobat Reader is a free PDF viewer, you can get it from the official website: https://get.adobe.com/reader/
Please replace the <*_USERNAME> and <*_PASSWORD> part to your actual username and password in this document.
2 Deploying cluster environment
If the cluster environment already exists, then you may skip this chapter. (Check the infrastructure software list to see that software is already installed and can pass the Checkpoint A ,Checkpoint B).
2.1 Installing an OS on the Management Node
Install an official version of CentOS 7.4 on the management node and you can select the minimum installation.
2.2 Deploying the OS on Other Nodes in the Cluster
2.2.1 Configuring Environmental Variables
After logging into the management node, run the commands below to configure environmental variables for the entire installation process:
su root
cd ~
vi lico_env.local
Based on the following prompts, edit lico_env.local and save. (In the final file, ignore all annotations starting with #): Note: This article assumes that the node's BMC user name and password are the same, if inconsistent, need to be modified when installing to: Set xCAT node information
# Management node hostname
sms_name="head"
# Set the domain name
domain_name="hpc.com"
# Set OpenLDAP domain name
lico_ldap_domain_name="dc=hpc,dc=com"
# IP address of management node in the cluster intranet
sms_ip="192.168.0.1"
# Web interface corresponding to the management node IP address
sms_eth_internal="eth0"
# Subnet mask in the cluster intranet. If all nodes in the cluster already have OS
# installed, retain the default configurations.
internal_netmask="255.255.0.0"
# BMC username and password
bmc_username="<BMC_USERNAME>"
bmc_password="<BMC_PASSWORD>"
# OS mirror pathway for xCAT
iso_path="/isos"
# Local zypper repository directory for OS
os_repo_dir="/install/custom/server"
sdk_repo_dir="/install/custom/sdk"
# Local zypper repository directory for xCAT
xcat_repo_dir="/install/custom/xcat"
# Local Yum repository directory for Lenovo OpenHPC
ohpc_repo_dir="/install/custom/ohpc"
# Local Yum repository directory for LiCO-dep
lico_dep_repo_dir="/install/custom/lico-dep"
# Local Yum repository directory for LiCO
lico_repo_dir="/install/custom/lico"
# Total compute nodes
num_computes="2"
# Prefix of compute node hostname. If OS has already been installed on all the
# nodes of the cluster, change the configuration according to actual conditions.
compute_prefix="c"
# Compute node hostname list. If OS has already been installed on all the
# nodes of the cluster, change the configuration according to actual conditions.
c_name[0]=c1
c_name[1]=c2
# Compute node IP list. If OS has already been installed on all the
# nodes of the cluster, change the configuration according to actual conditions.
c_ip[0]=192.168.0.6
c_ip[1]=192.168.0.16
# Network interface card MAC address corresponding to the compute node IP. If OS
# has already been installed on all the #nodes of the cluster, change the
# configuration according to actual conditions.
c_mac[0]=fa:16:3e:73:ec:50
c_mac[1]=fa:16:3e:27:32:c6
# Compute node BMC address list.
c_bmc[0]=192.168.1.6
c_bmc[1]=192.168.1.16
# Total login nodes
num_logins="1"
# Login node hostname list. If OS has already been installed on all the nodes
# of the cluster, change the configuration according to actual conditions..
l_name[0]=l1
# Login node IP list. If OS has already been installed on all the nodes
# of the cluster, change the configuration according to actual conditions.
l_ip[0]=192.168.0.15
#Network interface card MAC address corresponding to the login node IP.
# If OS has already been installed on all the nodes of the cluster, change
# the configuration according to actual conditions.
l_mac[0]=fa:16:3e:2c:7a:47
# Login node BMC address list.
l_bmc[0]=192.168.1.15
Run the following command to take configuration file to take effect:
chmod 600 lico_env.local
source lico_env.local
After the cluster environment is set up, you need to configure the public network IP on the login or management node to log in LiCO web portal from the external network.
2.2.2 Get Local Repository
Create directory for ISOs storing.
mkdir -p ${iso_path}
CentOS: Download CentOS-7-x86_64-Everything-1708.iso from the official website, copy it to the pathway ${iso_path} and run the commands below:
# run the command below to get verification code of the iso file, and you can get # another
verification code from here, then make sure they are the same.
cd ${iso_path}
sha256sum CentOS-7-x86_64-Everything-1708.iso
cd ~
#mount image
mkdir -p ${os_repo_dir}
mount -o loop ${iso_path}/CentOS-7-x86_64-Everything-1708.iso ${os_repo_dir}
#configuration local repository
cat << eof > ${iso_path}/EL7-OS.repo
[EL7-OS]
name=el7-centos
enabled=1
gpgcheck=0
type=rpm-md
baseurl=file://${os_repo_dir}
eof
cp -a ${iso_path}/EL7-OS.repo /etc/yum.repos.d/
RHELS: Copy the RHEL-7.4-20170711.0-Server-x86_64-dvd1.iso and RHEL-7.4-20170711.0-Server-x86_64-dvd1.iso.MD5SUM files to the ${iso_path} directory and run the following commands:
#Check the validity of the iso file:
cd ${iso_path}
md5sum -c RHEL-7.4-20170711.0-Server-x86_64-dvd1.iso.MD5SUM
cd ~
#mount image
mkdir -p ${os_repo_dir}
mount -o loop ${iso_path}/RHEL-7.4-20170711.0-Server-x86_64-dvd1.iso ${os_repo_dir}
#configuration local repository
cat << eof > ${iso_path}/RHELS74-OS.repo
[RHELS7-OS]
name=RHELS7-OS
enabled=1
gpgcheck=0
type=rpm-md
baseurl=file://${os_repo_dir}
eof
cp -a ${iso_path}/RHELS74-OS.repo /etc/yum.repos.d/
2.2.3 Installing Lenovo xCAT
Download the package: https://hpc.lenovo.com/downloads/18a/xcat-2.13.8.lenovo3_confluent-1.8.2_lenovo_confluent-0.8.1-el7.tar.bz2 Upload the package to management node, and then run the commands below to install xCAT:
# Create xcat local repository
yum install -y bzip2
mkdir -p $xcat_repo_dir
tar -xvf xcat-2.13.8.lenovo3_confluent-1.8.2_lenovo_confluent-0.8.1-el7.tar.bz2 -C $xcat_repo_dir
cd $xcat_repo_dir/lenovo-hpc-el7
./mklocalrepo.sh
cd ~
# install xcat
yum install -y xCAT
systemctl start xcatd
source /etc/profile.d/xcat.sh
2.2.4 Prepare OS Mirrors for Other Nodes
If all nodes in the cluster have an OS installed, you may skip this step. CentOS: Please perform the following command to prepare the OS image for the other nodes:
copycds ${iso_path}/CentOS-7-x86_64-Everything-1708.iso
RHELS: Please perform the following command to prepare the OS image for the other nodes:
copycds ${iso_path}/RHEL-7.4-20170711.0-Server-x86_64-dvd1.iso
CentOS: Run the commands below to confirm that the OS image has been copied.
lsdef -t osimage
#Output should be
centos7.4-x86_64-install-compute (osimage)
centos7.4-x86_64-netboot-compute (osimage)
centos7.4-x86_64-statelite-compute (osimage)
RHELS: Run the commands below to confirm that the OS image has been copied.
lsdef -t osimage
#Output should be
rhels7.4-x86_64-install-compute (osimage)
rhels7.4-x86_64-netboot-compute (osimage)
rhels7.4-x86_64-statelite-compute (osimage)
CentOS: Nouveau module is an accelerated open source driver for NVIDIA cards. Following NVIDIA official installation guide, this module should disabled before installing CUDA driver:
chdef -t osimage centos7.4-x86_64-install-compute addkcmdline="rdblacklist=nouveau
nouveau.modeset=0 R::modprobe.blacklist=nouveau"
RHELS: Nouveau module is an accelerated open source driver for NVIDIA cards. Following NVIDIA official installation guide, this module should disabled before installing CUDA driver:
chdef -t osimage rhels7.4-x86_64-install-compute addkcmdline="rdblacklist=nouveau
nouveau.modeset=0 R::modprobe.blacklist=nouveau"
2.2.5 Set xCAT node information
Run the commands below to import the compute node configuration in the lico_env.local file to xCAT:
for ((i=0; i<$num_computes; i++)); do
mkdef -t node ${c_name[$i]} groups=compute,all arch=x86_64 netboot=xnba mgt=ipmi
bmcusername=${bmc_username} bmcpassword=${bmc_password} ip=${c_ip[$i]} mac=${c_mac[$i]}
bmc=${c_bmc[$i]} serialport=0 serialspeed=115200;
done
Run the commands below to import the login node configuration in the lico_env.local file to xCAT:
for ((i=0; i<$num_logins; i++)); do
mkdef -t node ${l_name[$i]} groups=login,all arch=x86_64 netboot=xnba mgt=ipmi
bmcusername=${bmc_username} bmcpassword=${bmc_password} ip=${l_ip[$i]} mac=${l_mac[$i]}
bmc=${l_bmc[$i]} serialport=0 serialspeed=115200;
done
Note: If the BMC user name and password of the node are inconsistent, run the following command to modify:
tabedit ipmi
Run the commands below to configure the root account password for the node (Set a password based on the command to be modified)
chtab key=system passwd.username=root passwd.password=<ROOT_PASSWORD>
2.2.6 Add Host Resolution
If the cluster already has OS installed and can resolve the IP address through the hostname, skip this step. Run the commands below to add host resolution:
chdef -t site domain=${domain_name}
chdef -t site master=${sms_ip}
chdef -t site nameservers=${sms_ip}
sed -i "/^\s*${sms_ip}\s*.*$/d" /etc/hosts
sed -i "/\s*${sms_name}\s*/d" /etc/hosts
echo "${sms_ip} ${sms_name} ${sms_name}.${domain_name} " >> /etc/hosts
makehosts
2.2.7 Configuring DHCP and DNS Services
If all nodes in the cluster have OS installed, skip this step.
makenetworks
makedhcp -n
makedns -n
2.2.8 Installing a Node OS through the Network
If all nodes in the cluster have OS installed, skip this step. CentOS: Run the commands below to set and install the necessary OS mirror.
nodeset all osimage=centos7.4-x86_64-install-compute
rsetboot all net -u
rpower all reset
RHELS: Run the commands below to set and install the necessary OS mirror.
nodeset all osimage=rhels7.4-x86_64-install-compute
rsetboot all net -u
rpower all reset
It takes several minutes to finish the OS installation, you can use the below command to check the progress:
nodestat all
2.2.9 Checkpoint A
Run the commands below to check if installation is complete:
psh all uptime
#Output should be
c1: 05:03am up 0:02, 0 users, load average: 0.20, 0.13, 0.05
c2: 05:03am up 0:02, 0 users, load average: 0.20, 0.14, 0.06
l1: 05:03am up 0:02, 0 users, load average: 0.17, 0.13, 0.05
……
2.3 Installing Infrastructure Software for Node
2.3.1 List of Infrastructure Software to be installed
The installation node fields are expressed as follows: H: Management node, L: Login node, C: Compute node
Software
Name Component Name Version
Service
Name
Installation
Node Notes
nfs nfs-utils 1.3.0 nfs-server H
ntp ntp 4.2.6 ntpd H
slurm ohpc-slurm-server 1.3.3 munge, H
slurmctld
ohpc-slurm-client 1.3.3 munge,
slurmd
C,L
ganglia ganglia-gmond-ohpc 3.7.2 gmond H,C,L
singularity singularity-ohpc 2.4 H
cudnn 7 C cuda
cuda 9.1 C
Only needs to
be installed
on the GPU
node
Openmpi3-gnu7-ohpc 3.0.0 H
mpich-gnu7-ohpc 3.2 H
mpi
mvapich2-gnu7-ohpc 2.2 H
Install at least
one of three
types of MPI
2.3.2 Set Local Yum Repository for Management Node
Download the package: https://hpc.lenovo.com/lico/downloads/5.1/Lenovo-OpenHPC-1.3.3.CentOS_7.x86_64.tar Then upload the package to management node. Run the commands below to configure the local Lenovo OpenHPC repository:
mkdir -p $ohpc_repo_dir
tar xvf Lenovo-OpenHPC-1.3.3.CentOS_7.x86_64.tar -C $ohpc_repo_dir
$ohpc_repo_dir/make_repo.sh
2.3.3 Configuring Local Yum Repository for Compute and
Login Nodes
Run the commands below to install the Yum toolkit:
psh all yum --setopt=\*.skip_if_unavailable=1 install -y yum-utils
Run the commands below to add a local repository:
cp /etc/yum.repos.d/Lenovo.OpenHPC.local.repo /var/tmp
sed -i '/^baseurl=/d' /var/tmp/Lenovo.OpenHPC.local.repo
sed -i '/^gpgkey=/d' /var/tmp/Lenovo.OpenHPC.local.repo
echo "baseurl=http://${sms_name}/${ohpc_repo_dir}/CentOS_7" >>
/var/tmp/Lenovo.OpenHPC.local.repo
echo "gpgkey=http://${sms_name}/${ohpc_repo_dir}/CentOS_7/repodata/repomd.xml.key" >>
/var/tmp/Lenovo.OpenHPC.local.repo
# Distribute files
xdcp all /var/tmp/Lenovo.OpenHPC.local.repo /etc/yum.repos.d/
psh all echo -e %_excludedocs 1 \>\> ~/.rpmmacros
Run the following command to shut down the yum source access to the external network. This step can be performed according to the actual situation. If the operating system itself does not install enough packages, the subsequent installation steps may fail:
psh all yum-config-manager --disable CentOS\*
2.3.4 Configuring LiCO Dependencies Repository
Download the package: https://hpc.lenovo.com/lico/downloads/5.1/lico-dep-5.1.el7.x86_64.tgz Then upload the package to management node. Run the commands below to configure the Yum repository for the management node. Please make sure management node should configure local operating system yum repository for following actions:
mkdir -p $lico_dep_repo_dir
tar xvf lico-dep-5.1.el7.x86_64.tgz -C $lico_dep_repo_dir
$lico_dep_repo_dir/mklocalrepo.sh
If the cluster already exists, check whether your version is consistent with the list of chapter 3.1 List of LiCO Dependencies to be installed. Run the commands below to configure the Yum repository for other nodes:
cp /etc/yum.repos.d/lico-dep.repo /var/tmp
sed -i '/^baseurl=/d' /var/tmp/lico-dep.repo
sed -i '/^gpgkey=/d' /var/tmp/lico-dep.repo
echo "baseurl=http://${sms_name}/${lico_dep_repo_dir}" >> /var/tmp/lico-dep.repo
echo "gpgkey=http://${sms_name}/${lico_dep_repo_dir}/RPM-GPG-KEY-LICO-DEP-EL7" >>
/var/tmp/lico-dep.repo
# Distribute files
xdcp all /var/tmp/lico-dep.repo /etc/yum.repos.d
2.3.5 Installing Slurm
Run the commands below to install the base package:
yum install -y lenovo-ohpc-base
Run the commands below to install Slurm:
yum install -y ohpc-slurm-server
Run the commands below to install the Slurm client:
psh all yum install -y ohpc-base-compute ohpc-slurm-client lmod-ohpc
The following optional command will prevent non-root logins to the compute nodes unless they are already running a slurm job on that node submitted by the userid being used for logging in:
psh all echo "\""account required pam_slurm.so"\"" \>\> /etc/pam.d/sshd
2.3.6 Configuring NFS
Run the following commands to create the share directory of /opt/ohpc/pub. This directory is necessary. If you have already shared this directory from the management node and mounted it on all of the other nodes, you can skip this step.
#Management node sharing /opt/ohpc/pub for OpehHPC
yum install -y nfs-utils
echo "/opt/ohpc/pub *(ro,no_subtree_check,fsid=11)" >> /etc/exports
exportfs -a
# Installing NFS for Cluster Nodes
psh all yum install -y nfs-utils
#Configure shared directory for cluster nodes
psh all mkdir -p /opt/ohpc/pub
psh all echo "\""${sms_ip}:/opt/ohpc/pub /opt/ohpc/pub nfs nfsvers=3,nodev,noatime 0 0"\"" \>\> /etc/fstab
#Mount shared directory
psh all mount /opt/ohpc/pub
Run the following commands to create user share directory, this document takes /home as an example, also you can choose other directory.
#Management node sharing /home
echo "/home *(rw,no_subtree_check,fsid=10,no_root_squash)" >> /etc/exports
exportfs -a
# if /home already mounted, unmount it first
psh all "sed -i '/ \/home /d' /etc/fstab"
psh all umount /home
#Configure shared directory for cluster nodes
psh all echo "\""${sms_ip}:/home /home nfs nfsvers=3,nodev,nosuid,noatime 0 0"\"" \>\> /etc/fstab
#Mount shared directory
psh all mount /home
2.3.7 Configuring NTP
If NTP service has already been configured for nodes in the cluster, skip this step. Run the commands below:
echo "server 127.127.1.0" >> /etc/ntp.conf
echo "fudge 127.127.1.0 stratum 10" >> /etc/ntp.conf
systemctl enable ntpd
systemctl start ntpd
psh all yum install -y ntp
psh all echo "\""server ${sms_ip}"\"" \>\> /etc/ntp.conf
psh all systemctl enable ntpd
psh all systemctl start ntpd
# check service
psh all "ntpq -p | tail -n 1"
2.3.8 Installing CUDA
Run the commands below to install CUDA and CUDNN on all the GPU compute nodes (if only a subset of nodes have GPUs, replace "compute" argument in psh commands with node range corresponding to GPU nodes): 1.Install CUDA: Download cuda_9.1.85_387.26_linux.run from https://developer.nvidia.com/cuda-downloads and copy it to share directory /home. If the operating system is configured to boot to a graphical desktop, run the commands below to configure the operating system to boot to the text console, and then restart the system:
psh compute systemctl set-default multi-user.target
psh compute reboot
Download NVIDIA driver from http://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/tesla/390.46/nvidia-diag-driver-local-repo-rhel7-390.46-1.0-1.x86_64.rpm&lang=us&type=Tesla and copy it to share directory /home.
Run the following commands as shown:
Run the commands below to install CUDA:
psh compute yum install -y kernel-devel gcc gcc-c++
psh compute /home/cuda_9.1.85_387.26_linux.run --silent --toolkit --samples --no-opengl-libs --verbose
--override
2.Install cuDNN: Download cuDNN 7.0.5 (The downloaded package is cudnn-9.1-linux-x64-v7.tgz) from https://developer.nvidia.com/cudnn to directory /root: Run the commands below to install cuDNN:
cd ~
tar -xvf cudnn-9.1-linux-x64-v7.tgz
xdcp compute cuda/include/cudnn.h /usr/local/cuda/include
xdcp compute cuda/lib64/libcudnn_static.a /usr/local/cuda/lib64
xdcp compute cuda/lib64/libcudnn.so.7.0.5 /usr/local/cuda/lib64
psh compute "ln -s /usr/local/cuda/lib64/libcudnn.so.7.0.5 /usr/local/cuda/lib64/libcudnn.so.7"
psh compute "ln -s /usr/local/cuda/lib64/libcudnn.so.7 /usr/local/cuda/lib64/libcudnn.so"
psh compute chmod a+r /usr/local/cuda/include/cudnn.h
psh compute chmod a+r /usr/local/cuda/lib64/libcudnn*
3.Configuring Environmental Variables Certain environment variables need to be set in order to ensure proper operation of the CUDA package. This can be accomplished by modifying the configuration files described in the commands below. These commands should be run on the management node, even though CUDA isn't installed on the management node, to facilitate deployment of these files, and the CUDA environment variables, across all of the compute nodes in the cluster that contain GPUs:
echo "/usr/local/cuda/lib64" >> /etc/ld.so.conf.d/cuda.conf
echo "export CUDA_HOME=/usr/local/cuda" >> /etc/profile.d/cuda.sh
echo "export PATH=/usr/local/cuda/bin:\$PATH" >> /etc/profile.d/cuda.sh Distribute configuration files:
xdcp compute /etc/ld.so.conf.d/cuda.conf /etc/ld.so.conf.d/cuda.conf
xdcp compute /etc/profile.d/cuda.sh /etc/profile.d/cuda.sh Run the commands below on the GPU nodes to determine if the GPU can be identified:
psh compute ldconfig
psh compute nvidia-smi
psh compute "cd /root/NVIDIA_CUDA-9.1_Samples/1_Utilities/deviceQuery; make; ./deviceQuery" | xcoll
4.Set CUDA's self-start
psh compute rpm -ivh /home/nvidia-diag-driver-local-repo-rhel7-390.46-1.0-1.x86_64.rpm
psh compute yum install -y cuda-drivers
#configuration
psh compute sed -i '/Wants=syslog.target/a\Before=slurmd.service' /usr/lib/systemd/system/nvidia-
persistenced.service
psh compute systemctl daemon-reload
psh compute systemctl enable nvidia-persistenced
psh compute systemctl start nvidia-persistenced
2.3.9 Configuring Slurm
Download slurm.conf from https://hpc.lenovo.com/lico/downloads/5.1/examples/conf/ to /etc/slurm on the management node, and modify this file according to the instructions in section 7.6. Run the commands below to distribute the configuration:
xdcp all /etc/slurm/slurm.conf /etc/slurm/slurm.conf
xdcp all /etc/munge/munge.key /etc/munge/munge.key
Download gres.conf from https://hpc.lenovo.com/lico/downloads/5.1/examples/conf/ to /etc/slurm on the GPU node, and follow the instructions in that section 7.7 to modify this file as needed. Non-GPU nodes do not need this file. Run the commands below to start service:
#Start management node service
systemctl enable munge
systemctl enable slurmctld
systemctl restart munge
systemctl restart slurmctld
#Start other node service
psh all systemctl enable munge
psh all systemctl restart munge
psh all systemctl enable slurmd
psh all systemctl restart slurmd
2.3.10 Installing Ganglia
Install Ganglia on management node, run the commands below:
# install Ganglia
yum install -y ganglia-gmond-ohpc
# Download gmond.conf from
#https://hpc.lenovo.com/lico/downloads/5.1/examples/conf/ganglia/management/, and copy it
#to the /etc/ganglia/ directory on the management node, then modify
# the hostname in the /etc/ganglia/gmond.conf file to the management
# node's hostname for the udp_send_channel setting.
echo net.core.rmem_max=10485760 > /usr/lib/sysctl.d/gmond.conf
/usr/lib/systemd/systemd-sysctl gmond.conf
sysctl -w net.core.rmem_max=10485760
# Install Ganglia on compute node
psh all yum install -y ganglia-gmond-ohpc
# Download gmond.conf from https://hpc.lenovo.com/lico/downloads/5.1/examples/conf/ganglia/,
# and copy it to the /var/tmp/ directory of the management node, then modify
# the hostname in the /var/tmp/gmond.conf file to the management
# node's hostname for the udp_send_channel setting.
#Distribute configuration
xdcp all /var/tmp/gmond.conf /etc/ganglia/gmond.conf
#Start management node service
systemctl enable gmond
systemctl start gmond
# start other nodes service
psh all systemctl enable gmond
psh all systemctl start gmond
#run the command to see of all the nodes are listed
gstat -a
2.3.11 Installing MPI
Run the commands below:
yum install -y openmpi3-gnu7-ohpc mpich-gnu7-ohpc mvapich2-gnu7-ohpc
The above commands will install three modules (OpenMPI, MPICH, and MVAPICH) to the system, and the user can use lmod to choose the specific MPI module to be used. OpenHPC provides a module package to set the default module. The following command will set the OpenMPI module as the default:
yum install -y lmod-defaults-gnu7-openmpi3-ohpc
To set the MPICH module as the default, run:
yum install -y lmod-defaults-gnu7-mpich-ohpc
To set the MVAPICH module as the default, run:
Here is table of interconnect support for each MPI type from OpenHPC (x means support)
Ethernet(TCP) InfiniBand Omni-Path
yum install -y lmod-defaults-gnu7-mvapich2-ohpc
MPICH x
MVAPICH2 x
MVAPICH2(psm2) x
OpenMPI x x x
OpenMPI(PMIx) x x x
Note: If you want to use MVAPICH2 (psm2), you should install mvapich2-psm2-gnu7-ohpc. If you want to use OpenMPI (PMIx), you should install openmpi3-pmix-slurm-gnu7-ohpc. However, openmpi3-gnu7-ohpc and openmpi3-pmix-slurm-gnu7-ohpc is incompatible and mvapich2-psm2-gnu7-ohpc and mvapich2-gnu7-ohpc is incompatible.
2.3.12 Installing Singularity
Singularity is an HPC-facing lightweight container framework. Run the commands below to install Singularity:
yum install -y singularity-ohpc
Edit the /opt/ohpc/pub/modulefiles/ohpc file, and on the “module try-add” block, add the blow content as last line:
module try-add singularity
On the “module del” block, add the blow content as first line:
module del singularity
Run the following command:
source /etc/profile.d/lmod.sh
Note: Changes to /opt/ohpc/pub/modulefiles/ohpc may be lost when default modules are changed by installing lmod-defaults* package. In that case, modify /opt/ohpc/pub/modulefiles/ohpc file again, or, alternatively, add "module try-add singularity" to the bottom of /etc/profile.d/lmod.sh.
2.3.13 Checkpoint B
Run the commands below to test if Slurm is installed normally:
sinfo
#Output should be
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 1-00:00:00 2 idle c[1-2]
……
The status of all nodes should be ‘idle’, ‘idle*’ is not acceptable. Run the commands below to add a test account:
useradd -m test
echo "MERGE:" > syncusers
echo "/etc/passwd -> /etc/passwd" >> syncusers
echo "/etc/group -> /etc/group" >> syncusers
echo "/etc/shadow -> /etc/shadow" >> syncusers
xdcp all -F syncusers
Log in the test account and use Slurm distributed test program:
su - test
mpicc -O3 /opt/ohpc/pub/examples/mpi/hello.c
srun -n 8 -N 1 -w compute --pty /bin/bash
prun ./a.out
#Output should be
Master compute host = c1
Resource manager = slurm
Launch cmd = mpiexec.hydra -bootstrap slurm ./a.out
Hello, world (8 procs total)
--> Process # 0 of 8 is alive. -> c1
--> Process # 4 of 8 is alive. -> c2
--> Process # 1 of 8 is alive. -> c1
--> Process # 5 of 8 is alive. -> c2
--> Process # 2 of 8 is alive. -> c1
--> Process # 6 of 8 is alive. -> c2
--> Process # 3 of 8 is alive. -> c1
--> Process # 7 of 8 is alive. -> c2
Note: After the finished the command, notice that you exit to the root user of the management
node.
3 Installing LiCO Dependencies
3.1 List of LiCO Dependencies to be installed
The installation node fields are expressed as follows: H: Management node, L: Login node, C: Compute node
Software
Name Component Name Version
Service
Name
Installation
Node Notes
rabbitmq rabbitmq-server 3.6.15 rabbitmq-
server
H
postgresql postgresql-server 9.2.23 postgresql H
influxdb influxdb 1.4.2 influxdb H
confluent confluent 1.8.1 confluent H
slapd-ssl-config 1.0.0 slapd H
nss-pam-ldapd 0.8.13 nslcd H,C,L
libuser 0.60 H
openldap
libuser-python 0.60 H
gmond gpu
plugin
gmond-ohpc-gpu-
module
1.0.0 C Only needs
to be
installed on
the GPU
node
3.2 Installing RabbitMQ
LiCO uses RabbitMQ as a message broker. Run the commands below to install:
#Install RabbitMQ on the management node
yum install -y rabbitmq-server
#Start RabbitMQ service
systemctl enable rabbitmq-server
systemctl start rabbitmq-server
3.3 Installing PostgreSQL
LiCO uses PostgreSQL as an object-relational database for data storage. Run the commands below to install:
#Install PostgreSQL on the management node
yum install -y postgresql-server
#Initialization and passwords can be changed as needed.
su - postgres
echo <PG_PASSWORD> > /var/tmp/pwfile
initdb -U postgres --pwfile /var/tmp/pwfile /var/lib/pgsql/data
rm /var/tmp/pwfile
exit
#Starting PostgreSQL
systemctl enable postgresql
systemctl start postgresql
#Create LiCO database
export PGPASSWORD=<PG_PASSWORD>
psql -U postgres -c "CREATE DATABASE lico;"
3.4 Installing InfluxDB
LiCO uses InfluxDB as a time series database for storage monitoring. Run the commands
below to install: Run the following command to create InfluxDB users.
#Enter the InfluxDB shell
influx
#create database
create database lico
#use database
use lico
# To create an administrator user, please note that the password must be a string, otherwise the error is
# reported.
create user <INFLUX_USERNAME> with password '<INFLUX_PASSWORD>' with all privileges
# exit the influxDB shell
exit
#configuration
sed -i '/auth-enabled = false/a\ auth-enabled = true' /etc/influxdb/config.toml
#restart InfluxDB
systemctl restart influxdb
#Install InfluxDB
yum install -y influxdb
#Start InfluxDB
systemctl enable influxdb
systemctl start influxdb
3.5 Installing Confluent
Run the commands below to install:
yum install -y python2-crypto
yum install -y confluent
# Start confluent
systemctl enable confluent
systemctl start confluent
# Create confluent count
confetty create /users/<CONFLUENT_USERNAME> password=<CONFLUENT_PASSWORD>
If you need to use the web console reference appendix of Configuring Confluent web console
3.6 Configuring user authentication
3.6.1 Installing OpenLDAP-server
OpenLDAP is an open-source version of the lightweight directory access protocol. LiCO recommends using OpenLDAP to manage users; however, it also supports other authentication services compatible with Linux-PAM. If you have already configured OpenLDAP for the cluster, or another authentication service is being used, skip this step. Run the commands below:
#Install OpenLDAP
yum install -y slapd-ssl-config
slapadd -v -l /usr/share/openldap-servers/lico.ldif -f /etc/openldap/slapd.conf -b ${lico_ldap_domain_name}
# set password
# Get the key using the following command and enter <LDAP_PASSWORD> when prompted.
slappasswd
# Edit the file /etc/openldap/slapd.conf to cover the contents of the rootpw with the key obtained.
rootpw <ENCTYPT_PASSWORD>
chown -R ldap:ldap /var/lib/ldap
chown ldap:ldap /etc/openldap/slapd.conf
#Edit configuration files
vi /etc/sysconfig/slapd
# Please make sure the next two lines are uncommented
SLAPD_URLS="ldapi:/// ldap:/// ldaps:///"
SLAPD_OPTIONS="-f /etc/openldap/slapd.conf"
#Start OpenLDAP service
systemctl enable slapd
systemctl start slapd
# check service
systemctl status slapd
3.6.2 Installing libuser
The libuser module is a useful toolkit for OpenLDAP. The installation of this module is optional. However, for this document, some commands like ‘luseradd’ are implemented by libuser. So it is recommended to install libuser. Run the commands below to install libuser:
yum install -y libuser libuser-python
Configure libuser:
vi /etc/libuser.conf
[import]
login_defs = /etc/login.defs
default_useradd = /etc/default/useradd
[defaults]
crypt_style = sha512
modules = ldap
create_modules = ldap
[userdefaults]
LU_USERNAME = %n
LU_GIDNUMBER = %u
LU_GECOS = %n
# Pay attention to modify option below
LU_HOMEDIRECTORY = /home/%n
LU_SHADOWNAME = %n
LU_SHADOWMIN = 0
LU_SHADOWMAX = 99999
[groupdefaults]
LU_GROUPNAME = %n
[files]
[shadow]
[ldap]
# modify <LDAP_ADDRESS> to management node IP
server = ldap://<LDAP_ADDRESS>
# Pay attention to modify option below
# make sure <DOMAIN> should be the same with ${lico_ldap_domain_name} defined in lico_env.local
basedn = <DOMAIN>
userBranch = ou=People
groupBranch = ou=Group
binddn = uid=admin,<DOMAIN>
bindtype = simple
[sasl]
3.6.3 Installing openldap-client
Run the commands below:
echo "TLS_REQCERT never" >> /etc/openldap/ldap.conf
xdcp all /etc/openldap/ldap.conf /etc/openldap/ldap.conf
3.6.4 Installing nss-pam-ldapd
nss-pam-ldapd is a name service switch module and pluggable authentication module. LiCO uses nss-pam-ldapd for user authentication. Install nss-pam-ldapd on the management node,run the commands below:
yum install -y nss-pam-ldapd authconfig
authconfig --useshadow --usemd5 --enablemkhomedir --disablecache --enablelocauthorize --
disablesssd --disablesssdauth --enableforcelegacy --enableldap --enableldapauth --disableldaptls
--ldapbasedn=${lico_ldap_domain_name} --ldapserver="ldap://${sms_name}" --updateall
echo "rootpwmoddn uid=admin,${lico_ldap_domain_name}" >> /etc/nslcd.conf
#Start management node service
systemctl enable nslcd
systemctl start nslcd
Install nss-pam-ldapd on other nodes, run the commands below:
psh all yum install -y nss-pam-ldapd authconfig
psh all authconfig --useshadow --usemd5 --enablemkhomedir --disablecache --enablelocauthorize
--disablesssd --disablesssdauth --enableforcelegacy --enableldap --enableldapauth --
disableldaptls --ldapbasedn="${lico_ldap_domain_name}" --ldapserver="ldap://${sms_name}" --
updateall
psh all echo "\""rootpwmoddn uid=admin,${lico_ldap_domain_name}"\"" \>\> /etc/nslcd.conf
#Start other node services
psh all systemctl enable nslcd
psh all systemctl start nslcd
3.7 Installing Gmond GPU Plug-In
On all GPU nodes, run the commands below to install:
psh compute yum install -y gmond-ohpc-gpu-module
psh compute "ls /etc/ganglia/conf.d/*.pyconf|grep -v nvidia|xargs rm"
# Start gmond
psh compute systemctl restart gmond
4 Installing LiCO
4.1 List of LiCO Components to be installed
The installation node fields describe as follows: H: Management node, L: Login node, C: Compute node
Software Name Component Name Version Service Name Installation
Node Notes
lico-core lico-core 5.1.0 lico H
lico-portal lico-portal 5.1.0 H,L
lico-confluent-
proxy
1.0.0 H
lico-vnc-proxy 1.0.0 H
lico-core-
extend
lico-ai-image 1.1.0 H
lico-env 1.0.0 H,C,L lico-env
lico-ai-expert 1.1.0 C Only for AI
functions
lico-ganglia-mond 1.0.0 lico-ganglia-
mond
H
lico-confluent-
mond
1.0.0 lico-
confluent-
mond
H
lico monitor
lico-vnc-mond 1.0.0 lico-vnc-
mond
C Install if you
need to run
VNC
lico-sms-agent 1.1.0 lico-sms-
agent
L Install if you
need to send
alerts via
SMS
lico-wechat-agent 1.1.0 lico-wechat-
agent
L Install if you
need to send
alerts via
lico alarm
notification
lico-mail-agent 1.2.0 lico-mail-
agent
L Install if you
need to send
alerts via
4.2 Getting the LiCO Installation Package
Please obtain LiCO release package from Lenovo ESD website
(https://lenovoesd.flexnetoperations.com/control/lnvo/login). Please contact Lenovo
salesperson how to subscribe and get ESD authentication.
The LiCO 5.1.0 release package for EL7 is lico-release-5.1.0.el7.tar.gz. Upload the release package to the management node.
4.3 Configuring the Local Yum Depository for LiCO
Run the commands below to configure the local Yum depository for the management node:
mkdir -p $lico_repo_dir
tar zxvf lico-release-5.1.0.el7.tar.gz -C $lico_repo_dir --strip-components 1
cd $lico_repo_dir
./Makerepo
Run the commands below to configure the local Yum depository for other nodes:
cp /etc/yum.repos.d/lico-release.repo /var/tmp
sed -i '/baseurl=/d' /var/tmp/lico-release.repo
echo "baseurl=http://${sms_name}/${lico_repo_dir}/RPMS" >> /var/tmp/lico-release.repo
#Distribute repo files
xdcp all /var/tmp/lico-release.repo /etc/yum.repos.d/
4.4 Installing the Management Node
Run the commands below to install the LiCO module on the management node:
yum install -y lico-core lico-mond lico-confluent-proxy lico-ai-expert lico-env lico-ai-image
If you need to provide web service on the management node, run the commands below:
yum install -y lico-portal
If you need to provide email, SMS, and WeChat service on the management node, run the commands below:
#Install email module
yum install -y lico-mail-agent
#Install SMS module
yum install -y lico-sms-agent
#Install WeChat module
yum install -y lico-wechat-agent
If you need to use the VNC component, run the following command:
yum install -y lico-vnc-proxy
4.5 Installing the Login Node
Run the commands below to install the LiCO module on the login node:
psh login yum install -y lico-env
If you need to provide web service on the login node, run the commands below:
psh login yum install -y lico-portal
If you need to provide email SMS, and WeChat service on the login node, run the commands below:
If you want to provide a basic compiling environment on login node, recommend to run the commands below:
psh login yum groupinstall -y “Development Tools”
psh login yum install -y glibc-devel
Note: This is an optional step. To install these packages successfully, an internet based repository may be needed. How to setup compiling environment is out of this document’s scope, you can setup the repository according to your network condition.
4.6 Installing the Compute Node
Run the commands below to install the LiCO module on the compute node:
psh compute yum install -y lico-env lico-ai-expert
If you need to use the VNC component, please refer appendix of Configuring VNC
#Install email module
psh login yum install -y lico-mail-agent
#Install SMS module
psh login yum install -y lico-sms-agent
#Install WeChat module
psh login yum install -y lico-wechat-agent
5 Configuring LiCO
5.1 Configuring Service Account
In the management node, use the tool lico-passwd-tool. Follow the prompts to enter username and password for PostgreSQL, InfluxDB and Confluent to complete the configuration.
lico-passwd-tool
# Please fill in the following input according to the actual configuration
Please enter the postgres username:
Please enter the postgres password:
Please confirm the postgres password:
Please enter the influxdb username:
Please enter the influxdb password:
Please confirm the influxdb password:
Please enter the confluent username:
Please enter the confluent password:
Please confirm the confluent password:
5.2 Configuring Cluster Nodes
Before using LiCO, follow these steps to import cluster information to the system. Run the commands below:
cp /etc/lico/nodes.csv.example /etc/lico/nodes.csv
Edit cluster information file:
/etc/lico/nodes.csv
We recommend downloading this file to the local computer and edit using Excel or other table editing software. After you’re finished, you can upload it to the management node and overwrite the original file. The cluster information file is comprised of the following six parts:
5.2.1 Room Information
Room Information Table:
Enter only one piece of server room information in the fields below:
Name Room Name
location_description Room Description
5.2.2 Logic Group Information
Managers can use logic groups to divide the nodes in the cluster into groups. The logic groups do not impact the use of computer resources or permissions configurations. Logic Group Information Table:
Enter at least one logic group in the fields below:
name Logic Group Name
5.2.3 Room Row Information
Room row is the rack order in the room, and you need to enter information for the rack row in which the cluster node is located. Row Information Table:
Enter at least one piece of row information in the fields below:
name Row Name (Cannot be repeated in the same room)
index Row Order (Must be a positive integer and cannot be repeated in the same room)
belonging_room Room Location (Add the configuration name to the room information table)
5.2.4 Rack Information
Input rack information for the cluster node location. The rack information table is below:
Enter at least the information of one rack in the fields below:
name Rack Name (Cannot be repeated in the same room)
column Rack Location Column (Must be a positive integer and cannot be repeated in the same line)
belonging_row Rack Location Row Name (Add the configuration name to the row information table)
5.2.5 Chassis Information
If there is a chassis in the cluster, enter the chassis information. The chassis information table is below:
Fields description as following:
name Chassis Name (Cannot be repeated in the same room)
belonging_rack Rack Location Name (Use the name of the configuration in the rack information table.)
location_u_in_rack The location of the chassis base in the rack (Unit: u). In a standard cabinet, the value should be between 1 and 42.
machine_type Chassis Type (Can use model number. See appendix of Chassis Model List).
5.2.6 Node Information
Enter information for all nodes in the cluster into the node information table. The node information table can be found below(Broken display because the table is too long):
Fields description as following:
name The node hostname does not need a domain name.
nodetype For node type, choose: Head: Management node Login: Login node Compute: Compute node
immip IP address of the node’s BMC system.
hostip IP address of the node on the host network.
machine_type Product name for the node. (For available product names, see appendix Product List).
ipmi_user XCC (BMC) Account for the Node
ipmi_pwd XCC (BMC) Password for the Node
belonging_service_ node
Large clusters require setting up a service node to which the node belongs. If there is no service node, leave the field blank.
belonging_rack Node Location Rack Name (Add the configuration name to the rack information table)
belonging_chassis Node Location Chassis Name (Leave blank if it can be located in any chassis.) Configure the chassis name in the chassis information table.
location_u Node Location: If the node is located in the chassis, enter the slot in the chassis in which the node is located; If the node is located in a rack, enter the location of the node base in the rack (Unit: u).
width Node Width (Full: 1, Half: 0.5)
height Node Height (Unit: u)
groups Node Location Logic Group Name (A node can belong to multiple logic groups. Group names should be separated by “;”.) Configure the logic group name in the logic group information table.
5.3 Configuring LiCO Services
The LiCO service configuration file is located in:
/etc/lico/lico.ini
This configuration file controls the operating parameters for various LiCO background service components. Modify based on your needs and with reference to the instructions below. If you change the configuration while LiCO is running, restart LiCO for the configuration to take effect。
systemctl restart lico
All matters not raised in the configuration instructions below can be modified after consulting with service staff. Modifications made without a service consultation could result in the system failing to run normally.
5.3.1 Infrastructure Configuration
The following parts of the infrastructure configuration are modifiable:
#Cluster domain settings
domain = hpc.com
5.3.2 Database Configuration
The following parts of the database configuration are modifiable:
#PostgreSQL address
db_host = 127.0.0.1
#PostgreSQL port
db_port = 5432
#PostgreSQL database name
db_name = lico
#InfluxDB address
influx_host = 127.0.0.1
#InfluxDB port
influx_port = 8086
#InfluxDB database name
influx_database = lico
5.3.3 Login Configuration
The following parts of the login configuration are modifiable:
If user login failed more than the “login_fail_max_chance”, system will suspend this user for 45 minutes, the suspended user cannot login system even using a valid authentication information. Administrator can resume a suspended user by command line or web portal, please reference: Resume user or LiCO Administrator Guide.
5.3.4 Storage Configuration
The following parts of the storage configuration are modifiable:
#Maximum number of login password error attempts
login_fail_max_chance = 3
#Shared storage directory
#If strictly adhering to the shared directory configurations in this document, change
#to: share_dir = /home
share_dir = /home
5.3.5 Scheduler Configuration
The following parts of the scheduler configuration are modifiable:
#The scheduler configuration currently supports Slurm, LSF, and Torque. Slurm is the default.
scheduler_software = slurm
5.3.6 Alert Configuration
The following parts of the alert configuration are modifiable:
#WeChat proxy server address
wechat_agent_url = http://127.0.0.1:18090
#WeChat notification template ID
wechat_template_id = <WECHAT_TEMPLATE_ID>
#SMS proxy server address
sms_agent_url = http://127.0.0.1:18092
#Email proxy server address
mail_agent_url = http://127.0.0.1:18091
The above only needs to be configured if WeChat, SMS, and email proxy modules are installed for the cluster, please obtain the <WECHAT_TEMPLATE_ID> from the following website: https://mp.weixin.qq.com/wiki?t=resource/res_main&id=mp1445241432
5.3.7 Cluster Configuration
The following parts of the cluster configuration are modifiable:
#Confluent port
confluent_port = 4005
5.3.8 Functional Configuration
The following parts of the functional configuration are modifiable:
[app:django]
#For the functional module used, modify based on the actual module purchased.
#If only using the HPC module, change to: use = hpc
#If only using the AI module, change to: use = ai
#After changing the configuration, you must enter lico init and refresh the data table.
use = hpc+ai
5.4 Configuring LiCO Components
5.4.1 lico-vnc-mond
Create file /var/tmp/vnc-mond.ini and add following configuration:
[vnc]
url=http://127.0.0.1:18083/session
timeout=30
Note: Change 127.0.0.1 to the actual IP of management node. Distribute configuration file
xdcp compute /var/tmp/vnc-mond.ini /etc/lico/vnc-mond.ini
5.4.2 lico-env
Configure SSH commands, run the following commands:
psh compute echo "\""auth required pam_python.so pam_lico.py --url=http://${sms_name}:18080
--timeout=40 --ignore_conn_error"\"" \>\> /etc/pam.d/sshd
psh compute echo "\""account required pam_python.so pam_lico.py --url=http://${sms_name}:18080
--timeout=40 --ignore_conn_error"\"" \>\> /etc/pam.d/sshd
Configure su commands, run the following commands:
psh compute echo "\""auth required pam_python.so pam_lico.py --url=http://${sms_name}:18080
--timeout=40 --ignore_conn_error"\"" \>\> /etc/pam.d/su
psh compute echo "\""account required pam_python.so pam_lico.py --url=http://${sms_name}:18080
--timeout=40 --ignore_conn_error"\"" \>\> /etc/pam.d/su
5.4.3 lico-portal
Modify the pathway files below for nodes installed with the lico-portal module that need to provide external web services with different ports to prevent conflicting. You can edit file /etc/nginx/nginx.conf and change the port to 8080
listen 8080 default_server;
listen [::]:8080 default_server;
Moreover, you can modify https default port 443 to other ports, please modify it in the file /etc/nginx/conf.d/https.conf
listen <port> ssl http2;
Note: make sure the port is not used by other application and it is not blocked by the firewall.
Edit file /etc/nginx/conf.d/sites-available/antilles.conf and change the first line.:
set $lico_host 127.0.0.1;
You can change the 127.0.0.1 to the management node IP if lico-portal does not run on management node.
Edit the file of /etc/lico/portal.conf can add a custom shortcut links, the configuration format can refer to the file: /etc/lico/portal.conf.example
If you want to hide the server version information, make the following modifications: Edit file /etc/nginx/nginx.conf and add server_tokens off in the http area. Example as following:
http{
......
sendfile on;
server_tokens off;
……
}
5.4.4 lico-ganglia-mond
Edit file /etc/lico/ganglia-mond.conf: Change cfg_db_host 127.0.0.1 and cfg_db_port 5432 to the actual PostgreSQL service. Change host 127.0.0.1 and port 8086 to the actual InfluxDB service. If you follow this document, configuration file should be as following on management node with default port.
influxdb {
cfg_db_host 127.0.0.1
cfg_db_port 5432
cfg_db_name lico
host 127.0.0.1
port 8086
database lico
timeout 10
}
5.4.5 lico-confluent-proxy
Edit /etc/lico/confluent-proxy.ini, change the database section as following:
[DEFAULT]
# database
db_host = 127.0.0.1
db_port = 5432
db_name = lico
Change db_host = 127.0.0.1 and db_port = 5432 to the actual PostgreSQL service. If you follow this document, it should be installed on manangement node with default port. If there are multiple Confluents in the cluster, you need to configure the [app:main] section as following:
If you need to change information about the Confluent user, refer to Installing Confluent, create or change the user information, and update the information according to the steps shown in Configuring Service Account.
5.4.6 lico-confluent-mond
Edit file /etc/lico/confluent-mond.ini: Change db_host = 127.0.0.1 and db_port = 5432 to the actual PostgreSQL service. Change host = 127.0.0.1 and port = 8086 to the actual InfluxDB service. If you follow this document, they should be installed on management node with default port.
[database]
db_host = 127.0.0.1
db_port = 5432
db_name = lico
[influxdb]
host = 127.0.0.1
port = 8086
database = lico
timeout = 10
5.4.7 lico-wechat-agent
Edit file /etc/lico/wechat-agent as follows:
#The configurations below should be changed based on the specific environment
appid = <APPID>
secret = <SECRET>
Get <APPID> and <SECRET> references: https://mp.weixin.qq.com/wiki?t=resource/res_main&id=mp1445241432
[app:main]
use = cluster-confluent-proxy
5.5 Initializing the System
Run the commands below to initialize LiCO:
lico init
5.6 Initializing Users
Run the commands below to initialize LiCO admin users. Add a LDAP user with username and password. You can change the username/password as needed, but if you do not want use LDAP you can skip this.
luseradd <HPC_ADMIN_USERNAME> -P <HPC_ADMIN_PASSWORD>
psh all "su - <HPC_ADMIN_USERNAME> -c whoami" | xcoll"
The “luseradd” command prompts you to enter the LDAP administrator password, enter the <LDAP_PASSWORD> you configured in section 3.6.1. Import user into LiCO, run the following command:
#Import user into LiCO and as admin
lico user_import -u <HPC_ADMIN_USERNAME> -r admin
5.7 Importing System Images
Obtain images from your salesperson, and refer the appendix to Import images into LiCO as system level image. However, you can also try to create image by yourselves. Please refer to the appendix Create image, and refer to the appendix Import images into LiCO as system level image.
6 Starting LiCO
Run the commands below to start LiCO:
#If the management node has to provide web service, start Nginx.
systemctl enable nginx
systemctl start nginx
#If the login node has to provide web service, start Nginx.
psh login systemctl enable nginx
psh login systemctl start nginx
#Start LiCO-related services
systemctl start lico-ganglia-mond
systemctl start lico-confluent-mond
#Start LiCO
systemctl start lico
After LICO starts, delete the file lico_env.local, run the following commands to do this:
rm -rf /root/lico_env.local
After the LiCO service is started, you can access LiCO through web browser, open the link https://<ip of login node>:<port>/ (port is what you set in /etc/nginx/conf.d/https.conf in section 5.4.3) in the web browser. If the installation is correct, you will see the LiCO login page. You can use the LDAP account set in section 5.6 to log into LiCO.
7 Appendix
7.1 Configuring VNC
This module only needs to be installed on a compute node that requires VNC functionality. Run the following command on the compute nodes, which you want to install the VNC function:
yum install -y gdm tigervnc tigervnc-server
yum install -y lico-vnc-mond
Edit /etc/gdm/custom.conf on these compute nodes, and make the following changes:
[xdmcp]
Enable=true
Run this command on these compute nodes to start VNC:
systemctl start lico-vnc-mond
vncserver -query localhost -securitytypes=none
If you need to install on all compute nodes, you can use the batch install command.
# install
psh compute yum install -y lico-vnc-mond
psh compute yum install -y gdm tigervnc tigervnc-server
# Distribution profile
xdcp compute /etc/gdm/custom.conf /etc/gdm/custom.conf
# start
psh compute systemctl start lico-vnc-mond
psh compute vncserver -query localhost -securitytypes=none
7.2 Configuring Confluent web console
If you want to open a node’s console from LiCO web portal, please config that node as below steps. After the following configurations are complete, you need to restart the node to make the configuration take effect.
7.2.1 RHEL
Edit file /etc/default/grub, and append the following fields to the end of the GRUB_CMDLINE_LINUX line:
console=ttyS0,115200
Run the following command to configure the UEFI mode to start:
grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
Run the following command to configure the legacy mode to start:
grub2-mkconfig -o /boot/grub2/grub.cfg
7.2.2 CentOS
Edit file /etc/default/grub, and append the following fields to the end of the GRUB_CMDLINE_LINUX line:
console=ttyS0,115200
Run the following command to configure the uefi mode to start:
grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
Run the following command to configure the legacy mode to start:
grub2-mkconfig -o /boot/grub2/grub.cfg
7.3 LiCO commands
7.3.1 Set the LDAP administrator password
Note: this command works only when "use_libuser = true" in the file lico.ini
lico setldappasswd
Please input your ldap password:
Please confirm the ldap password:
7.3.2 Change user’s role
lico user_changerole -u <ROLE_USERNAME> -r admin
Parameter interpretation:
-u Specify the username to modify
-r Specify the role to be set (admin/operator/user)
7.3.3 Resume user
lico user_resume <SUSPENDED_USERNAME>
Parameter interpretation:
Directly specify users who need to be resumed
7.3.4 Import user
Refer: Initializing Users
7.3.5 Import AI image
Refer: Importing System Image
7.4 Cluster Service Summary
Software
Name Component Name Service Name
Default port Installation Node
lico-core lico 18081/tcp H
lico-ganglia-mond lico-ganglia-mond 8661/tcp,8662/tcp H
lico-confluent-proxy 18081/tcp H
lico-confluent-mond lico-confluent-
mond
H
lico-vnc-proxy 18082/tcp,18083/t
cp
C
lico-vnc-mond lico-vnc-mond C
lico-sms-agent lico-sms-agent 18092/tcp L
lico-wechat-agent lico-wechat-agent 18090/tcp L
lico
lico-mail-agent lico-mail-agent 18091/tcp L
ngnix ngnix 80/tcp,443/tcp L|H
rabbitmq rabbitmq-server 5672/tcp H
postgresql postgresql 5432/tcp H
lico
depende
nt
confluent confluent 4005/tcp,13001/tc
p
H
influxdb influxdb 8086/tcp,8088/tcp H
slapd 389/tcp,636/tcp H ldap
nslcd H,C,L
nfs nfs 2049/tcp H
ntp ntpd H
munge H,C
slurmctld 6817/tcp H
slurm
slurmd 6818/tcp C
cluster
ganglia gmond 8649/tcp,8649/ud
p
H,C,L
7.5 Security improvement
7.5.1 Binding setting
If you install system following this document, there are some components which listen ports are bind on all address by default. To improve the system security level, we recommend that you change the default settings. RabbitMQ
Recommend bind on loopback address (127.0.0.1). Edit /etc/rabbitmq/rabbitmq.config, remove {"::1", 5672}, for example:
[
{
rabbit,
[
{
tcp_listeners, [{"127.0.0.1", 5672}]
}
]
}
]
PostgreSQL It binds on loopback address (127.0.0.1) by default. Not recommend change the default setting.
Confluent It binds on loop address (127.0.0.1) by default. Not recommend change the default setting.
InfluxDB Recommend bind on loopback address (127.0.0.1). Edit /etc/influxdb/config.toml, uncomment the line #bind-address=”:8086” in http part and change it to bind-address=”127.0.0.1:8086”, for example:
[http]
#Determines whether HTTP endpoint is enabled.
#enabled = true
#The bind address used by the HTTP service.
bind-address = "127.0.0.1:8086"
lico-core Recommend bind on internal address, if there is no login nodes in cluster and lico-portal is installed on the same node with lico-core, recommend bind on loopback address. Edit /etc/lico/supervisor.d/antilles.ini, change the parameter of command in program antilles, change “--bind:18080” to “--bind <INTELNAL IP>:18080”, for example:
[program:confluent_proxy]
command=/usr/bin/gunicorn --paste /etc/lico/confluent-proxy.ini --bind 172.20.0.14:18080 --
log-config /etc/lico/confluent-proxy.ini --workers 1 --threads 50 --timeout 3600 --worker-class
gevent --keep-alive 65 --log-level info --access-logfile - --error-logfile - --capture-output
lico-ganalia-mond The default setting is only trusted loopback address (127.0.0.1). Not recommend change the default setting.
lico-confluent-proxy Recommend bind on internal address, if there is no login nodes in cluster and lico-portal is installed on the same node with lico-confluent-proxy, recommend bind on loopback address. Edit /etc/lico/supervisor.d/confluent-proxy.ini, change the parameter of command in program confluent-proxy, change “--bind:18081” to “--bind <INTELNAL IP>:18081”, for example:
[program:confluent_proxy]
command=/usr/bin/gunicorn --paste /etc/lico/confluent-proxy.ini --bind 172.20.0.14:18081 --
log-config /etc/lico/confluent-proxy.ini --workers 1 --threads 50 --timeout 3600 --worker-class
gevent --keep-alive 65 --log-level info --access-logfile - --error-logfile - --capture-output
lico-vnc-proxy Recommend bind on internal address, if there is no login nodes in cluster and lico-portal is installed on the same node with lico-vnc-proxy, recommend bind on loopback address. Edit /etc/lico/supervisor.d/vncproxy.ini, change the parameter of command in program vncproxy, change “--bind:18083” to “--bind <INTELNAL IP>:18083”, the IP in websockify parameter “--token-source” also need to be changed to <INTELNAL IP>, for example:
[program:vncproxy]
command=/usr/bin/gunicorn --paste /etc/lico/vnc-proxy.ini –bind 172.20.0.14:18083 --log-config
/etc/lico/vnc-proxy.ini --workers 1 --timeout 3600 --worker-class gevent --keep-alive 65 --log-
level info --access-logfile - --error-logfile - --capture-output
……
[program:websockify]
command=/usr/bin/websockify 18082 --token-plugin=JSONTokenApi --token-
source='http://172.20.0.14:18083/lookup?token=%s'
lico-wechat-agent Recommend bind on internal address, if there is no login nodes in cluster and lico-portal is installed on the same node with lico-wechat-agent, recommend bind on loopback address. Edit /etc/sysconfig/lico-wechat-agent, change the GUNICORN_CMD_ARGS, change “--bind:18090” to “--bind <INTELNAL IP>:18090”, for example:
# lico-wechat-agent environment file
GUNICORN_CMD_ARGS= \
--bind 172.20.0.14:18090 \
--log-config /etc/lico/wechat-agent.ini \
--workers 1 \
--threads 4 \
--worker-class gevent \
--timeout 3600 \
--keep-alive 65 \
--log-level info \
--access-logfile - \
--error-logfile - \
--capture-output True
lico-mail-agent Recommend bind on internal address, if there is no login nodes in cluster and lico-portal is installed on the same node with lico-mail-agent, recommend bind on loopback address. Edit /etc/sysconfig/lico-mail-agent, change the GUNICORN_CMD_ARGS, change “--bind:18091” to “--bind <INTELNAL IP>:18091”, for example:
# lico-mail-agent environment file
GUNICORN_CMD_ARGS= \
--bind 172.20.0.14:18091 \
--log-config /etc/lico/mail-agent.ini \
--workers 1 \
--threads 4 \
--worker-class gevent \
--timeout 3600 \
--keep-alive 65 \
--log-level info \
--access-logfile - \
--error-logfile - \
--capture-output True
lico-sms-agent Recommend bind on internal address, if there is no login nodes in cluster and lico-portal is installed on the same node with lico-mail-agent, recommend bind on
loopback address. Edit /etc/sysconfig/lico-sms-agent, change the GUNICORN_CMD_ARGS, change “--bind:18092” to “--bind <INTELNAL IP>:18092”, for example:
# lico-sms-agent environment file
GUNICORN_CMD_ARGS= \
--bind 172.20.0.14:18092 \
--log-config /etc/lico/sms-agent.ini \
--workers 1 \
--timeout 3600 \
--keep-alive 65 \
--log-level info \
--access-logfile - \
--error-logfile - \
--capture-output True
7.5.2 Firewall setting
Considering the security of the system, we recommend that you enable the firewall on the management node, and login nodes. If you setup the cluster and install LiCO follow this document, you can follow the below steps to setup your firewall. We recommend you reference the official firewall setup document to setup it by yourself. You can visit the official document from: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/security_guide/sec-configuring_firewalld Run the below commands to install and enable the firewall:
yum install -y firewalld
systemctl enable firewalld
systemctl start firewalld
Management node Run the below commands to add roles to public zone:
#Add SSH service port
firewall-cmd --zone=public --add-port=22/tcp --permanent
#Add httpd service port
firewall-cmd --zone=public --add-port=80/tcp --permanent
#Add NFS service port
firewall-cmd --zone=public --add-port=2049/tcp --permanent
#Add Ganglia gmond port
firewall-cmd --zone=public --add-port=8649/udp --permanent
#Add Slurm slurmctld port
firewall-cmd --zone=public --add-port=6817/tcp --permanent
#Add OpenLDAP slapd port
firewall-cmd --zone=public --add-port=636/tcp --permanent
firewall-cmd --zone=public --add-port=389/tcp --permanent
#Add lico-confluent-proxy port
firewall-cmd --zone=public --add-port=18081/tcp --permanent
#Add lico-core port
firewall-cmd --zone=public --add-port=18080/tcp --permanent
#Add TensorBoard random binding port range
firewall-cmd --zone=public --add-port=20000-25000/tcp --permanent
Run the below command to add the internal network interface into the public zone:
firewall-cmd --zone=public --add-interface=eth0 --permanent
firewall-cmd --zone=public --add-interface=eth1 --permanent
Notes: eth0 and eth1 should be your internal and external network interface. Run the below command to enable roles:
firewall-cmd --complete-reload
Login node Run the below commands to add roles to public zone:
#Add SSH service port
firewall-cmd --zone=public --add-port=22/tcp --permanent
#Add Nginx service port, you can adjust 8443 to your setting
firewall-cmd --zone=public --add-port=8443/tcp --permanent
Run the below command to add the internal and external network interface into the public zone:
firewall-cmd --zone=public --add-interface=eth0 --permanent
firewall-cmd --zone=public --add-interface=eth1 --permanent
Notes: eth0 and eth1 should be your internal and external network interface. Run the below command to enable roles:
firewall-cmd --complete-reload
7.6 slurm.conf
Introduction to modifications to slurm.conf: Cluster Name:
ClusterName=mycluster
Management Node Name:
ControlMachine=c031
GPU Scheduling: In the cluster, this entry is used when there is a GPU node on the cluster. If there is no GPU node, delete this entry.
GresTypes=gpu,gpu_mem
Cluster Node Definitions: NodeName shows the node name. Gres shows the number of GPUs and GPU graphics memory size in every node (if this is not a GPU node, then delete the Gres content). CPUs shows the number of CPUs in a node. RealMemory shows the node memory size (Unit: M).
NodeName=c031 Gres=gpu:4,gpu_mem:10000 CPUs=28 RealMemory=200000 State=UNKNOWN
NodeName=c032 Gres=gpu:4,gpu_mem:10000 CPUs=28 RealMemory=200000 State=UNKNOWN
Partition Definitions: PartitionName shows the name of the partition. Nodes shows the nodes in the partition. Default shows if this partition has a default partition. When a user submits a job, choose a partition. If the user does not designate a partition, then the default partition will be used.
PartitionName=compute Nodes=c0[31-32] Default=YES MaxTime=INFINITE State=UP
PartitionName=compute1 Nodes=c0[31-32] Default=NO MaxTime=INFINITE State=UP
EnforcePartLimits Definitions: If you want to submit a direct error response when a job requests resources and exceeds the cluster resource amount, use the configuration below, or the job will remain in the queue:
More detail about how to configure slurm.conf, please refer slurm offical site:
https://slurm.schedmd.com/slurm.conf.html
7.7 gres.conf
This file describes thes GPUs installed and GPU memory on the GPU nodes. The content of the gres.conf file may vary based on the GPU node. The "Count" attribute of the "gpu_mem" setting shows the amount GPU memory per GPU (Unit: MB).
7.8 Chassis Model List
Model Code model Number of Slots Appearance
d2 D2 Enclosure
4
EnforcePartLimits=ALL
Name=gpu File=/dev/nvidia[0-3]
Name=gpu_mem Count=10000
7.9 Product List
Product Names Corresponding Machine Appearance
sd530 SD530
0.5U rack form factor
sr630 SR630
1U
sr650 SR650
2U
7.10 Import system image
System level container images can used for all the users in the cluster. The following steps show how to create and import system level container images.
7.10.1 Create image
LiCO is released with image bootstrap files for common used AI frameworks, bootstrap file looks like Docker file, user can use these image bootstrap files to create image, these files are under /opt/lico/examples/image/ on the management node. These bootstrap files are:
Framework Framework version
CPU/GPU Comments
Caffe 1.0 CPU
Caffe 1.0 CUDA9.1 --Support P100 and V100
--Caffe cannot support CUDA9 officially, we
changed the make file of the Caffe.
TensorFlow 1.6 CPU
TensorFlow 1.6 CUDA9.0 --Support P100 and V100
--TensorFlow cannot support CUDA9.1 officially,
so we use CUDA9.0.
Neon 2.4 CPU
Intel-Caffe 1.0.4 CPU
MXNet 1.1 CPU
MXNet 1.1 CUDA9.0 --Support P100 and V100
--MXNet cannot support CUDA9.1 officially, so we
use CUDA9.0.
Note: If there is no GPU nodes in the cluster, you can only create CPU images Note: GPU driver version of the cluster nodes should be 390.46 Prepare one build node,the minimum free storage of this node is 100G.This node should able to access internet. This node should have the same OS version, the same singularity version (2.4 https://github.com/singularityware/singularity/releases/tag/2.4) with the nodes in
the cluster. If you want to create GPU images, this node should have the same GPU and GPU driver with the nodes in the cluster. Copy these bootstrap files from management node to this build node, such as these bootstrap files are copied to the new directory /opt/images (Note: This directory and /var/tmp cannot be an NFS mount.), and make images. Ensure that squashfs-tools is installed.
cd /opt/images/
singularity build caffe-1.0-cpu.image caffe/caffe-1.0-cpu
singularity build caffe-1.0-gpu-cuda91.image caffe/caffe-1.0-gpu-cuda91
singularity build tensorflow-1.6-cpu.image tensorflow/tensorflow-1.6-cpu
singularity build tensorflow-1.6-gpu-cuda90.image tensorflow/tensorflow-1.6-gpu-cuda90
singularity build mxnet-1.1-cpu.image mxnet/mxnet-1.1-cpu
singularity build mxnet-1.1-gpu-cuda90.image mxnet/mxnet-1.1-gpu-cuda90
singularity build intel-caffe-1.0.4-cpu.image intel-caffe/intel-caffe-1.0.4-cpu
singularity build neon-2.4-cpu.image neon/neon-2.4-cpu
7.10.2 Import images into LiCO as system level image
Copy the created images to the management node, such as the images are copied to directory /opt/images (Note: This directory and /var/tmp cannot be an NFS mount.), then root user use the following commands to import images into LiCO.
cd /opt/images
lico import_system_image caffe-cpu $PWD/caffe-1.0-cpu.image singularity caffe
lico import_system_image caffe-gpu $PWD/caffe-1.0-gpu-cuda91.image singularity caffe
lico import_system_image tensorflow-cpu $PWD/tensorflow-1.6-cpu.image singularity tensorflow
lico import_system_image tensorflow-gpu $PWD/tensorflow-1.6-gpu-cuda90.image singularity tensorflow
lico import_system_image mxnet-cpu $PWD/mxnet-1.1-cpu.image singularity mxnet
lico import_system_image mxnet-gpu $PWD/mxnet-1.1-gpu-cuda90.image singularity mxnet
lico import_system_image intel-caffe $PWD/intel-caffe-1.0.4-cpu.image singularity intel-caffe
lico import_system_image neon $PWD/neon-2.4-cpu.image singularity neon
7.11 Troubleshooting Slurm issues
Using Slurm command sinfo to check the node status, If node status is drain: You can use the command to change the node status to normal scontrol update NodeName=host1 State=RESUME. If node status is down: --Using slurm command scontrol show nodes to see the node detail information, see the reason in the output of this command. --Check whether all the nodes have the same slurm.conf file under /etc/slurm, --Check whether service of slurmd, munge are active on all the nodes, and whether service of slurmctld is active on the management node.
--Check whether all the nodes have the same date and whether ntpd service is active on all the nodes.
If you meet the following warning text when using srun/prun to run mpi program:
Failed to create a completion queue (CQ):
……
Error: Cannot allocate memory
Please check whether soft memlock and hard memlock are unlimited in the file /etc/security/limits.conf on management node and compute nodes. If not, you should set them as unlimited and restart the nodes to take effect:
* soft memlock unlimited
* hard memlock unlimited
7.12 Update OS packages
Please check the latest version for CentOS/RHEL 7.4 on the web site http://mirror.centos.org/centos-
7/. The below steps assume the latest version is 7.4.1708.
1. Prepare packages: For Red Hat Enterprise Linux, if you have subscribed Red Hat Enterprise Linux, you should update package repository from Red Hat. For CentOS, you should prepare one CentOS 7.4 node which can access the Internet, then run the below command to create update package repository.
centos7_4_latest_version=7.4.1708
cat << eof > /etc/yum.repos.d/centos7_4_update.repo
[centos7_4_update]
name=centos7_4_update
baseurl=http://mirror.centos.org/centos/$centos7_4_latest_version/updates/x86_64/
mirrorlist=http://mirrorlist.centos.org/?release=$centos7_4_latest_version&arch=x86_64&repo=updates
gpgcheck=0
enabled=1
eof
yum install -y createrepo
yum install -y yum-utils
mkdir -p /opt/update
cd /opt/update
reposync --download-metadata -r centos7_4_update -e ./cache -n -a x86_64 -d
createrepo .
rm -rf cache
tar -zcf update.tgz centos7_4_update repodata
2. Update packages
Run command on management node.
mkdir -p /install/custom/update Upload created update.tgz file to /install/custom/update of the management node. Then run the command on the management node.
tar -xf update.tgz -C /install/custom/update
cat << eof > /etc/yum.repos.d/centos7_4_update.repo
[centos7_4_update]
name=centos7_4_update
baseurl= http://${sms_name}/install/custom/update
gpgcheck=0
enabled=1
eof
xdcp all /etc/yum.repos.d/centos7_4_update.repo /etc/yum.repos.d/centos7_4_update.repo
Run the below command on the management node to update package.
yum -y update --skip-broken
psh all yum -y update --skip-broken
7.13 Using a newer kernel with RETPOLINE support
If an updated kernel is to be used on the system the has RETPOLINE support enabled (for example a kernel with mitigations for the Spectre/Meltdown security vulnerability), then in addition to the kernel update, the toolchain has to be updated as well in order for the NVIDIA driver to build against this kernel. Additionally glibc should be updated. The following minimum update levels for the kernel, toolchain and glibc include this support: 1. For RHEL:
https://access.redhat.com/errata/RHSA-2018:0395 https://access.redhat.com/errata/RHBA-2018:0408 https://access.redhat.com/errata/RHBA-2017:3296
2. For CentOS: https://lists.centos.org/pipermail/centos-announce/2018-March/022768.html https://lists.centos.org/pipermail/centos-announce/2018-March/022789.html https://lists.centos.org/pipermail/centos-announce/2017-December/022650.html
Make sure to setup and enable a yum repository for these packages before the steps in section 2.2 in this document for the management node and before the steps in section 2.3.4 for the compute and managed nodes. This can be done as follows:
mkdir -p /install/custom/retpoline/RPMS Place the RPMs from all of the above links for RHEL or CentOS in
/install/custom/retpoline/RPMS.
cd /install/custom/retpoline/
yum install -y createrepo
createrepo RPMS
cat << eof > /etc/yum.repos.d/retpoline.repo
[retpoline]
name=retpoline
baseurl=file:///install/custom/retpoline/RPMS
gpgcheck=0
enabled=1
eof
cat << eof > /var/tmp/retpoline.repo
[retpoline]
name=retpoline
baseurl= http://${sms_name}/install/custom/retpoline/RPMS
gpgcheck=0
enabled=1
eof
xdcp all /var/tmp/retpoline.repo /etc/yum.repos.d/retpoline.repo
After this the new kernel should be installed on the management, compute and login nodes and those nodes rebooted: Management node, before section 2.2:
yum update kernel
reboot
Compute and Login Nodes, before section 2.3.4:
psh all yum update -y kernel
psh all reboot