high end computing at sdsc csm cluster management eva hocks san diego supercomputer center 2007

16
High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007

Upload: valerie-thomas

Post on 18-Jan-2018

226 views

Category:

Documents


0 download

DESCRIPTION

CSM setup nodes Configure Nodes  lshwinfo -p hmc -c dshmc07.sdsc.edu > /tmp/fr8_9  vi /tmp/fr8_9 : replace noname with cec_name no_hostname::hmc::dshmc07.sdsc.edu::fr9-cg13::001::7039::651::02151FF ds100::hmc::dshmc07.sdsc.edu::fr8-cg1::001::7039::651::  definenode -f /tmp/fr8_9 InstallOSName=AIX  systemid -p hmc hscroot  getadapters -n ds100 -z /tmp/ds100_adapters write to CSM database, include Federation_switch adapters  csm2nimnodes -n 'ds100' type='standalone' network_name='sdsc_net' platform='chrp' netboot_kernel='mp‘  netboot –n ds100  updatenode –n ds100

TRANSCRIPT

Page 1: High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007

High End Computing at SDSC

CSM Cluster ManagementEva HocksSan Diego Supercomputer Center2007

Page 2: High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007

Managing the HPC systems:DataStar System Software:

AIX 5.2 ML3 CSM 1.3.3.1 RSCT 2.3.3.3

System Management with CSM: Node setup Node Groups

Per frame Per function (NPACI,TG,POE,login,batch)

Page 3: High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007

CSM setup nodes Configure Nodes

lshwinfo -p hmc -c dshmc07.sdsc.edu > /tmp/fr8_9 vi /tmp/fr8_9 : replace noname with cec_name

no_hostname::hmc::dshmc07.sdsc.edu::fr9-cg13::001::7039::651::02151FF

ds100::hmc::dshmc07.sdsc.edu::fr8-cg1::001::7039::651::021519 definenode -f /tmp/fr8_9 InstallOSName=AIX systemid -p hmc hscroot getadapters -n ds100 -z /tmp/ds100_adapters write to CSM database, include Federation_switch adapters csm2nimnodes -n 'ds100' type='standalone'

network_name='sdsc_net' platform='chrp' netboot_kernel='mp‘ netboot –n ds100 updatenode –n ds100

Page 4: High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007

CSM_ADAPTERS_STANZA_FILEds100: MAC_address=00096B34E093 adapter_duplex=full

adapter_speed=100cable_type=N/Ainstall_server=192.168.236.31interface_name=en0location=U1.32-P1-H1/E1machine_type=installnetaddr=network_type=ensubnet_mask=

ds100: machine_type=secondary

interface_name=sn1 network_type=sn

netaddr=subnet_mask=location=U1.5-P1-H1/Q2

ds100: machine_type=secondary

interface_name=sn0 network_type=sn

netaddr=subnet_mask=location=U1.5-P1-H1/Q1

Page 5: High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007

Managing the HPC systems:DataStar System Management with CSM:

Management through Command line Rpower

Power on/off, query node status Install node: netboot –n ds100 Dsh

Install updates on nodes (installp,rpm,emgr) Monitor processes on nodes

Page 6: High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007

Managing the HPC systems:DataStar continued… System Configuration

Cfmupdatenode Synchronize system configuration modification with

nodes and system admins Run pre/post scripts to capture security rsiks and send

notification System monitoring:

Distributed Monitoring responds (GUI configured) Event driven email notification for on-call personnel GUI monitoring for operations personnel

Page 7: High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007

CSM monitoring

Page 8: High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007

CSM monitoring

Page 9: High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007
Page 10: High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007

CSM Event Monitoring GUI Event Monitoring

Critical Conditions: AnyNodeTmpFull AnyNodeVarSpace AnyNodeSwitchResponds LoadLeverProcess hostResponds see setting up ERRM Condition

Warning Conditions: Processor State

Page 11: High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007

CSM Event Monitoring GUI

Page 12: High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007

CSM Event Monitoringsetting up ERRM Conditions hostResponds ERRM condition

(redbook SG24-6953 page 193) mkcondition –r IBM.ManagedNode \

-e “Status!=1” –E “Status==1” \-d “Node hostResponds down” \-D “Node hostRsponds up” \-m l hostResponds

mkresponse –n LogStatustoFIFO \-s /usr/local/bin/LogStatusData \-E STATUS_FILE=/var/adm/spmondata” LogStatusData

mkcondresp “hostResponds” “LogStatusData”

Page 13: High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007
Page 14: High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007
Page 15: High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007

Event notificationWarning Event email==============================

=======Monday 07/26/04 19:12:34 Condition Name: LoadLProcess Severity: Warning Event Type: Event Expression: Processes.CurPidCount <= 0 Resource Name: ProgramName == 'LoadL_startd'

&& Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [0,1,{},{282654}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0==========================

===========

Rearm email:

=====================================

Monday 07/26/04 19:13:32 Condition Name: LoadLProcess Severity: Warning Event Type: Rearm event Expression: Processes.CurPidCount > 0Resource Name: ProgramName ==

'LoadL_startd' && Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [1,0,{270492},{270492}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0=================================

====

Page 16: High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007

CSM Information CSM Guide for the PSSP Systems Administrator SG24-6953

Useful scripts for ERRM conditions Command cross reference

IBM CSM for AIX 5L Administration Guide SA22-7918 CSM error messages

Web Sites http://www-124.ibm.com/developerworks/oss/mailman/listinfo/csm