high end computing at sdsc csm cluster management eva hocks san diego supercomputer center 2007

Post on 18-Jan-2018

226 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

CSM setup nodes Configure Nodes  lshwinfo -p hmc -c dshmc07.sdsc.edu > /tmp/fr8_9  vi /tmp/fr8_9 : replace noname with cec_name no_hostname::hmc::dshmc07.sdsc.edu::fr9-cg13::001::7039::651::02151FF ds100::hmc::dshmc07.sdsc.edu::fr8-cg1::001::7039::651::  definenode -f /tmp/fr8_9 InstallOSName=AIX  systemid -p hmc hscroot  getadapters -n ds100 -z /tmp/ds100_adapters write to CSM database, include Federation_switch adapters  csm2nimnodes -n 'ds100' type='standalone' network_name='sdsc_net' platform='chrp' netboot_kernel='mp‘  netboot –n ds100  updatenode –n ds100

TRANSCRIPT

High End Computing at SDSC

CSM Cluster ManagementEva HocksSan Diego Supercomputer Center2007

Managing the HPC systems:DataStar System Software:

AIX 5.2 ML3 CSM 1.3.3.1 RSCT 2.3.3.3

System Management with CSM: Node setup Node Groups

Per frame Per function (NPACI,TG,POE,login,batch)

CSM setup nodes Configure Nodes

lshwinfo -p hmc -c dshmc07.sdsc.edu > /tmp/fr8_9 vi /tmp/fr8_9 : replace noname with cec_name

no_hostname::hmc::dshmc07.sdsc.edu::fr9-cg13::001::7039::651::02151FF

ds100::hmc::dshmc07.sdsc.edu::fr8-cg1::001::7039::651::021519 definenode -f /tmp/fr8_9 InstallOSName=AIX systemid -p hmc hscroot getadapters -n ds100 -z /tmp/ds100_adapters write to CSM database, include Federation_switch adapters csm2nimnodes -n 'ds100' type='standalone'

network_name='sdsc_net' platform='chrp' netboot_kernel='mp‘ netboot –n ds100 updatenode –n ds100

CSM_ADAPTERS_STANZA_FILEds100: MAC_address=00096B34E093 adapter_duplex=full

adapter_speed=100cable_type=N/Ainstall_server=192.168.236.31interface_name=en0location=U1.32-P1-H1/E1machine_type=installnetaddr=network_type=ensubnet_mask=

ds100: machine_type=secondary

interface_name=sn1 network_type=sn

netaddr=subnet_mask=location=U1.5-P1-H1/Q2

ds100: machine_type=secondary

interface_name=sn0 network_type=sn

netaddr=subnet_mask=location=U1.5-P1-H1/Q1

Managing the HPC systems:DataStar System Management with CSM:

Management through Command line Rpower

Power on/off, query node status Install node: netboot –n ds100 Dsh

Install updates on nodes (installp,rpm,emgr) Monitor processes on nodes

Managing the HPC systems:DataStar continued… System Configuration

Cfmupdatenode Synchronize system configuration modification with

nodes and system admins Run pre/post scripts to capture security rsiks and send

notification System monitoring:

Distributed Monitoring responds (GUI configured) Event driven email notification for on-call personnel GUI monitoring for operations personnel

CSM monitoring

CSM monitoring

CSM Event Monitoring GUI Event Monitoring

Critical Conditions: AnyNodeTmpFull AnyNodeVarSpace AnyNodeSwitchResponds LoadLeverProcess hostResponds see setting up ERRM Condition

Warning Conditions: Processor State

CSM Event Monitoring GUI

CSM Event Monitoringsetting up ERRM Conditions hostResponds ERRM condition

(redbook SG24-6953 page 193) mkcondition –r IBM.ManagedNode \

-e “Status!=1” –E “Status==1” \-d “Node hostResponds down” \-D “Node hostRsponds up” \-m l hostResponds

mkresponse –n LogStatustoFIFO \-s /usr/local/bin/LogStatusData \-E STATUS_FILE=/var/adm/spmondata” LogStatusData

mkcondresp “hostResponds” “LogStatusData”

Event notificationWarning Event email==============================

=======Monday 07/26/04 19:12:34 Condition Name: LoadLProcess Severity: Warning Event Type: Event Expression: Processes.CurPidCount <= 0 Resource Name: ProgramName == 'LoadL_startd'

&& Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [0,1,{},{282654}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0==========================

===========

Rearm email:

=====================================

Monday 07/26/04 19:13:32 Condition Name: LoadLProcess Severity: Warning Event Type: Rearm event Expression: Processes.CurPidCount > 0Resource Name: ProgramName ==

'LoadL_startd' && Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [1,0,{270492},{270492}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0=================================

====

CSM Information CSM Guide for the PSSP Systems Administrator SG24-6953

Useful scripts for ERRM conditions Command cross reference

IBM CSM for AIX 5L Administration Guide SA22-7918 CSM error messages

Web Sites http://www-124.ibm.com/developerworks/oss/mailman/listinfo/csm

top related