high end computing at sdsc csm cluster management eva hocks san diego supercomputer center 2007
DESCRIPTION
CSM setup nodes Configure Nodes lshwinfo -p hmc -c dshmc07.sdsc.edu > /tmp/fr8_9 vi /tmp/fr8_9 : replace noname with cec_name no_hostname::hmc::dshmc07.sdsc.edu::fr9-cg13::001::7039::651::02151FF ds100::hmc::dshmc07.sdsc.edu::fr8-cg1::001::7039::651:: definenode -f /tmp/fr8_9 InstallOSName=AIX systemid -p hmc hscroot getadapters -n ds100 -z /tmp/ds100_adapters write to CSM database, include Federation_switch adapters csm2nimnodes -n 'ds100' type='standalone' network_name='sdsc_net' platform='chrp' netboot_kernel='mp‘ netboot –n ds100 updatenode –n ds100TRANSCRIPT
High End Computing at SDSC
CSM Cluster ManagementEva HocksSan Diego Supercomputer Center2007
Managing the HPC systems:DataStar System Software:
AIX 5.2 ML3 CSM 1.3.3.1 RSCT 2.3.3.3
System Management with CSM: Node setup Node Groups
Per frame Per function (NPACI,TG,POE,login,batch)
CSM setup nodes Configure Nodes
lshwinfo -p hmc -c dshmc07.sdsc.edu > /tmp/fr8_9 vi /tmp/fr8_9 : replace noname with cec_name
no_hostname::hmc::dshmc07.sdsc.edu::fr9-cg13::001::7039::651::02151FF
ds100::hmc::dshmc07.sdsc.edu::fr8-cg1::001::7039::651::021519 definenode -f /tmp/fr8_9 InstallOSName=AIX systemid -p hmc hscroot getadapters -n ds100 -z /tmp/ds100_adapters write to CSM database, include Federation_switch adapters csm2nimnodes -n 'ds100' type='standalone'
network_name='sdsc_net' platform='chrp' netboot_kernel='mp‘ netboot –n ds100 updatenode –n ds100
CSM_ADAPTERS_STANZA_FILEds100: MAC_address=00096B34E093 adapter_duplex=full
adapter_speed=100cable_type=N/Ainstall_server=192.168.236.31interface_name=en0location=U1.32-P1-H1/E1machine_type=installnetaddr=network_type=ensubnet_mask=
ds100: machine_type=secondary
interface_name=sn1 network_type=sn
netaddr=subnet_mask=location=U1.5-P1-H1/Q2
ds100: machine_type=secondary
interface_name=sn0 network_type=sn
netaddr=subnet_mask=location=U1.5-P1-H1/Q1
Managing the HPC systems:DataStar System Management with CSM:
Management through Command line Rpower
Power on/off, query node status Install node: netboot –n ds100 Dsh
Install updates on nodes (installp,rpm,emgr) Monitor processes on nodes
Managing the HPC systems:DataStar continued… System Configuration
Cfmupdatenode Synchronize system configuration modification with
nodes and system admins Run pre/post scripts to capture security rsiks and send
notification System monitoring:
Distributed Monitoring responds (GUI configured) Event driven email notification for on-call personnel GUI monitoring for operations personnel
CSM monitoring
CSM monitoring
CSM Event Monitoring GUI Event Monitoring
Critical Conditions: AnyNodeTmpFull AnyNodeVarSpace AnyNodeSwitchResponds LoadLeverProcess hostResponds see setting up ERRM Condition
Warning Conditions: Processor State
CSM Event Monitoring GUI
CSM Event Monitoringsetting up ERRM Conditions hostResponds ERRM condition
(redbook SG24-6953 page 193) mkcondition –r IBM.ManagedNode \
-e “Status!=1” –E “Status==1” \-d “Node hostResponds down” \-D “Node hostRsponds up” \-m l hostResponds
mkresponse –n LogStatustoFIFO \-s /usr/local/bin/LogStatusData \-E STATUS_FILE=/var/adm/spmondata” LogStatusData
mkcondresp “hostResponds” “LogStatusData”
Event notificationWarning Event email==============================
=======Monday 07/26/04 19:12:34 Condition Name: LoadLProcess Severity: Warning Event Type: Event Expression: Processes.CurPidCount <= 0 Resource Name: ProgramName == 'LoadL_startd'
&& Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [0,1,{},{282654}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0==========================
===========
Rearm email:
=====================================
Monday 07/26/04 19:13:32 Condition Name: LoadLProcess Severity: Warning Event Type: Rearm event Expression: Processes.CurPidCount > 0Resource Name: ProgramName ==
'LoadL_startd' && Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [1,0,{270492},{270492}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0=================================
====
CSM Information CSM Guide for the PSSP Systems Administrator SG24-6953
Useful scripts for ERRM conditions Command cross reference
IBM CSM for AIX 5L Administration Guide SA22-7918 CSM error messages
Web Sites http://www-124.ibm.com/developerworks/oss/mailman/listinfo/csm