high end computing at sdsc

16
High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007

Upload: china

Post on 11-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

High End Computing at SDSC. CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007. Managing the HPC systems: DataStar. System Software: AIX 5.2 ML3 CSM 1.3.3.1 RSCT 2.3.3.3 System Management with CSM: Node setup Node Groups Per frame - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: High End Computing at SDSC

High End Computing at SDSC

CSM Cluster Management

Eva Hocks

San Diego Supercomputer Center

2007

Page 2: High End Computing at SDSC

Managing the HPC systems:DataStar System Software:

AIX 5.2 ML3 CSM 1.3.3.1 RSCT 2.3.3.3

System Management with CSM: Node setup Node Groups

Per frame Per function (NPACI,TG,POE,login,batch)

Page 3: High End Computing at SDSC

CSM setup nodes Configure Nodes

lshwinfo -p hmc -c dshmc07.sdsc.edu > /tmp/fr8_9 vi /tmp/fr8_9 : replace noname with cec_name

no_hostname::hmc::dshmc07.sdsc.edu::fr9-cg13::001::7039::651::02151FF

ds100::hmc::dshmc07.sdsc.edu::fr8-cg1::001::7039::651::021519

definenode -f /tmp/fr8_9 InstallOSName=AIX systemid -p hmc hscroot getadapters -n ds100 -z /tmp/ds100_adapters write to CSM database, include Federation_switch adapters csm2nimnodes -n 'ds100' type='standalone'

network_name='sdsc_net' platform='chrp' netboot_kernel='mp‘ netboot –n ds100 updatenode –n ds100

Page 4: High End Computing at SDSC

CSM_ADAPTERS_STANZA_FILEds100: MAC_address=00096B34E093 adapter_duplex=full

adapter_speed=100cable_type=N/Ainstall_server=192.168.236.31interface_name=en0location=U1.32-P1-H1/E1machine_type=installnetaddr=network_type=ensubnet_mask=

ds100:

machine_type=secondary

interface_name=sn1

network_type=sn

netaddr=

subnet_mask=

location=U1.5-P1-H1/Q2

ds100:

machine_type=secondary

interface_name=sn0

network_type=sn

netaddr=

subnet_mask=

location=U1.5-P1-H1/Q1

Page 5: High End Computing at SDSC

Managing the HPC systems:DataStar System Management with CSM:

Management through Command line Rpower

Power on/off, query node status Install node: netboot –n ds100 Dsh

Install updates on nodes (installp,rpm,emgr) Monitor processes on nodes

Page 6: High End Computing at SDSC

Managing the HPC systems:DataStar continued… System Configuration

Cfmupdatenode

Synchronize system configuration modification with nodes and system admins

Run pre/post scripts to capture security rsiks and send notification

System monitoring:

Distributed Monitoring responds (GUI configured) Event driven email notification for on-call personnel GUI monitoring for operations personnel

Page 7: High End Computing at SDSC

CSM monitoring

Page 8: High End Computing at SDSC

CSM monitoring

Page 9: High End Computing at SDSC
Page 10: High End Computing at SDSC

CSM Event Monitoring

GUI Event Monitoring Critical Conditions:

AnyNodeTmpFull AnyNodeVarSpace AnyNodeSwitchResponds LoadLeverProcess hostResponds see setting up ERRM Condition

Warning Conditions: Processor State

Page 11: High End Computing at SDSC

CSM Event Monitoring GUI

Page 12: High End Computing at SDSC

CSM Event Monitoringsetting up ERRM Conditions hostResponds ERRM condition

(redbook SG24-6953 page 193) mkcondition –r IBM.ManagedNode \

-e “Status!=1” –E “Status==1” \

-d “Node hostResponds down” \

-D “Node hostRsponds up” \

-m l hostResponds

mkresponse –n LogStatustoFIFO \

-s /usr/local/bin/LogStatusData \

-E STATUS_FILE=/var/adm/spmondata” LogStatusData

mkcondresp “hostResponds” “LogStatusData”

Page 13: High End Computing at SDSC
Page 14: High End Computing at SDSC
Page 15: High End Computing at SDSC

Event notification

Warning Event email

=====================================

Monday 07/26/04 19:12:34 Condition Name: LoadLProcess Severity: Warning Event Type: Event Expression: Processes.CurPidCount <= 0 Resource Name: ProgramName ==

'LoadL_startd' && Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [0,1,{},{282654}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0

=====================================

Rearm email:

=====================================

Monday 07/26/04 19:13:32 Condition Name: LoadLProcess Severity: Warning Event Type: Rearm event Expression: Processes.CurPidCount > 0Resource Name: ProgramName ==

'LoadL_startd' && Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [1,0,{270492},{270492}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0================================

=====

Page 16: High End Computing at SDSC

CSM Information

CSM Guide for the PSSP Systems Administrator SG24-6953 Useful scripts for ERRM conditions Command cross reference

IBM CSM for AIX 5L Administration Guide SA22-7918 CSM error messages

Web Sites http://www-124.ibm.com/developerworks/oss/mailman/listinfo/csm