triggers - adaptive computingadaptivecomputing.com/wp-content/media/pdf/triggers_moes.pdf ·...

40
Triggers Improving Availability Through Event-driven Automation Sean Moe 18 September 2009

Upload: others

Post on 21-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

TriggersImproving Availability Through

Event-driven Automation

Sean Moe18 September 2009

Page 2: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Agenda Problem:

Solution:

Advantages:

Tutorial:

Productivity & availability losses inherent in large-scale computing

Generic metrics, native resource managers and trigger technologies

Benefits of Moab triggers vs. generic event managers

How do I create, manage, and utilize triggers?

Please write your questions down

Page 3: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Trigger Demonstration

Event

“Fire” Clap

Action

Page 4: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Problem

Page 5: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Productivity Losses &Resource Availability

Middleware FailuresUser Ineffic ienciesHardware FailuresPartitioning FailuresIntra-job Ineffic ienciesEnvironmental Ineffic iencies

Page 6: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Need for Automation

Convenient in small resource pools

Necessary in large systems More failures More user complaints Less time for

administrators

Page 7: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Solution

Page 8: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Triggers:Automating Responses

1) Detect an event2) Perform diagnostics3) Execute action(s)

Email Admin

Shutdown Node

Workload CanBe Moved

Temp > 60

Page 9: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Moab Hierarchy

Moab

ResourceManager

ResourceManager

Page 10: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Generic Metrics Arbitrary information associated with

resources and workload

Widespread Usage

Decisions can be made and reports can be generated based on site-specific environmental factors

Power fluctuations Machine room temperature

Machine room chiller health Power failures

Network connectivity Network card failures

Hardware failures CPU temperatures

Network file server status Hard drive failures

Page 11: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Enabling Generic Metrics

# Example “temp.txt”# Temperature output from various nodesnode001 GMETRIC[temp]=113node002 GMETRIC[temp]=83node003 GMETRIC[temp]=107node004 GMETRIC[temp]=85

# moab.cfgRMCFG[native] TYPE=NATIVERMCFG[native] CLUSTERQUERYURL=file://$TOOLS/temp.txt

Page 12: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Where Triggers Come In

Moab

ResourceManager

ResourceManager

Triggers

Native ResourceManager

Other GMetrics

Page 13: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Advantages

Page 14: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Benefits of Triggers Actions are event-based and independent of

normal job scheduling

Attached to various scheduler objects

Inherit variable namespace from parent object

Can export/import data to/from other objects

Basis for dynamic workflow control

Page 15: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Trigger Dependencies & Dynamic Workflow Trigger variables allow for

complex dependency graphs

Multiple execution paths

Can rely on external dependencies

Other policy restrictions are also enforced

1 2 3

7 8

9

4 5 6

Page 16: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Triggers vs. Other Event Managers (cron,nagios,...)

Access to global resource information

Integrated control over workload

Access to high level resource and workload management facilities

Intelligent workload-aware responses

Page 17: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Tutorial

Page 18: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Trigger Attributes

AType

EType

Action

Action Type – what type of action to perform

Event Type – what event triggers this action

Action – usually a script or a Moab-related command

AType=exec EType=start Action=”report.pl”

Page 19: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Advanced Trigger Attributes

BlockTimeExpireTimeFailOffsetInterval

MaxRetryMultifireOffsetPeriod

RearmtimeTimeout

SchedulingDescription

FlagsName

Threshold

Administration

RequiresSets

Unsets

Dependencies

Page 20: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Trigger Example #1When a job is placed on hold, run a script whose first

parameter is the job ID

AType=exec

EType=hold

Action=”$TOOLS/held_job.pl $OID”

# moab.cfgJOBCFG[DEFAULT] TRIGGER=AType=exec,EType=hold,

Action=”$TOOLS/held_job.pl $OID”

Page 21: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Trigger Example #2Send an email when a node goes down

AType=exec

EType=fail

Action=”$TOOLS/down.pl $OID”

MultiFire=TRUE

RearmTime=1:00

# moab.cfgNODECFG[n01] TRIGGER=AType=exec,EType=fail,

Action=”$TOOLS/down.pl $OID”,MultiFire=TRUE,RearmTime=1:00

Page 22: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Trigger Example #3Create a 5-minute reservation after every job

to account for the job epilogue

AType=reserve

EType=end

Action=”5:00”

Description=”Reservation for job epilogue”

# moab.cfgJOBCFG[DEFAULT] TRIGGER=AType=reserve,EType=end,Action=”5:00”,

Description=”Reservation for job epilogue”

Page 23: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Trigger Example #4Execute an email script when user “bob” has

too much backlog

AType=exec

EType=threshold

Action=”$TOOLS/email.pl $OID”

Threshold=backlog>100

FailOffset=1:00# moab.cfgUSRCFG[bob] TRIGGER=AType=exec,EType=threshold,

Action=”$TOOLS/email.pl $OID”,Threshold=backlog>100,FailOffset=1:00

Page 24: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Trigger Example #5Toggle the MAXPROC parameter for user “alice” based on

total system usage

# moab.cfgSCHEDCFG[moab] TRIGGER=AType=changeparam,EType=threshold,

Action=”USERCFG[alice] MAXPROC=5”,Threshold=usage>90%,MultiFire=TRUE

SCHEDCFG[moab] TRIGGER=AType=changeparam,EType=threshold,Action=”USERCFG[alice] MAXPROC=100”,Threshold=usage<90%,MultiFire=TRUE

Page 25: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Trigger Example #6If a node's temperature goes above 60, reserve the node,

notify the admin and shut it down

# moab.cfgNODECFG[DEFAULT] TRIGGER=AType=internal,EType=threshold,

Action=reserve,Sets=RESERVED,Threshold=GMetric[temp]>60

NODECFG[DEFAULT] TRIGGER=AType=exec,EType=start,Action=”$TOOLS/node_email.pl $OID”,Requires=RESERVED

NODECFG[DEFAULT] TRIGGER=AType=exec,EType=start,Action=”$TOOLS/shutdown.pl $OID”,Requires=RESERVED

Page 26: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Trigger Example #7If scratch space fills up, reserve the node, run a cleanup script and send a message to the administrator based on

the result# moab.cfgNODECFG[DEFAULT] TRIGGER=AType=internal,EType=threshold,

Action=reserve,Sets=DRAINED,Threshold=GMetric[scratch]>1001

NODECFG[DEFAULT] TRIGGER=AType=exec,EType=start,Action=”$TOOLS/cleanup.pl $OID”,Requires=DRAINED,Sets=CLEANED.!FAILURE

NODECFG[DEFAULT] TRIGGER=AType=mail,EType=start,Action=”$OID is cleaned”,Requires=CLEANED

NODECFG[DEFAULT] TRIGGER=AType=mail,EType=start,Action=”$OID is not cleaned”,Requires=FAILURE

Page 27: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

More Examples# Run diagnostics when an RM fails (for 3 minutes)RMCFG[native] TYPE=NATIVE FAILTIME=3:00RMCFG[native] TRIGGER=AType=exec,EType=failure,

Action="$TOOLS/diagnose_rm.pl $OID"

# Associate a job trigger with a classCLASSCFG[batch] JOBTRIGGER=AType=exec,EType=preempt,

Action="$TOOLS/preempt_notify.pl $OID $OWNER $HOSTNAME"

# Send an email 24 hours before a reservation ends to notify/remind the userRSVCFG[apache_farm] TRIGGER=AType=exec,EType=end,Offset=-24:00:00,

Action="$TOOLS/rsv_end_email.pl $OID $OWNER $TIME"

# Standing triggerSCHEDCFG[moab] TRIGGER=AType=exec,EType=standing, Period=hour,

Action="$TOOLS/createjobs_hour.pl"

Page 28: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

What else can I do?

Email the owner of a particular reservation when the usage drops below a specific threshold to encourage efficient use of reserved resources

Launch an evaluation script 5 minutes before a job is scheduled to complete to gather more detailed statistics about how well the job ran

Guarantee a particular account a certain service level by emailing the admin, creating a reservation, and/or contacting a hosting utility for more resources if backlog exceeds an hour of waiting time

Page 29: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Other Ways to Create Triggers

# Attach a trigger to an objectmschedctl -c trigger <attr>=<val>[,<attr>=<val>...]

-o <obj_type>:<obj_val>

# Dynamically add a trigger to a reservationmrsvctl -c -T <attr>=<val>[,<attr>=<val>...]

# Submit a job with a triggermsub <job_id> -l 'trig=<attr>=<val>[\&<attr>=<val>...] '

NOTE: For security reasons, only users having a QoS with the'trigger' flag can submit jobs with attached triggers

Page 30: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Monitoring & ModifyingTriggers Monitoring:

mdiag -T [-v]

mdiag -T [-v] trigger.id

mdiag -T [-v] job.id

mdiag -T -V (shows a global view of all triggers

associated with the current user)

Modifying:mschedctl -m trigger:2 Atype=exec,Offset=200

Page 31: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Where Should I Start?

Moab already handles many common cases: Job start failures Node failures

Start with simple tasks: Sending emails Executing small scripts & then sending emails

Start with the most common resource, workflow, and service failures

Page 32: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Summary

Page 33: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

“I fell asleep.What did I miss?”

Losses in productivity and inefficiencies in resource availability are inherent in large computing environments

Built into Moab is an event-based trigger technology that allows for customized automated responses

As an extension of Moab, triggers have an integrated view of the resource pool – allowing for smarter resource-/workload-aware responses

Page 34: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Questions?

Page 35: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Additional Resources

Trigger documentation / how-to:http://www.clusterresources.com/products/mwm/moabdocs/1

9.0triggers.shtml

Enabling Generic Metrics documentation:http://www.clusterresources.com/products/mwm/docs/9.2acc

ounting.shtml#gmetric

Page 36: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Appendices

Page 37: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Possible Atype/EType Values

Action Types:cancel, changeparam, jobpreempt, mail, exec,

query, internal

Event Types:cancel, checkpoint, create, end, epoch, fail, hold,

migrate, modify, preempt, standing, start, threshold

Page 38: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Advanced Attributes BlockTime = seconds Moab will suspend normal operation until trigger finishes

executing ExpireTime = time at which trigger should be terminated if it has not already been

activated FailOffset = seconds threshold must exist before the trigger fires Interval = boolean that sets trigger to fire at regular intervals MaxRetry = times to execute action before trigger 'gives up' MultiFire = makes a trigger repeatable Offset = how long after event to fire (or before for 'end' events) RearmTime = how long to wait before rearming Requires = variable dependency Sets = variable on success, !variable on failure UnSets = variable destroyed on trigger success Timeout = how long trigger's process will run before being killed

Page 39: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

'Special' TriggersMail TriggersRequires MAILPROGRAM parameter in moab.cfg

Internal TriggersAction=”<OBJ_TYPE>:<OBJ_ID>:<ACTION>:<CONTEXT_DATA>”

Object types: job, node, reservation, standing reservation,

scheduler, user

Actions: cancel, complete (system job), destroy (VPC), modify,

reserve

# For example:

SRCFG[prov] TRIGGER=AType=internal,EType=start,

Action”node:$HOSTLIST:modify:os=rhel4”

Page 40: Triggers - Adaptive Computingadaptivecomputing.com/wp-content/media/pdf/Triggers_MoeS.pdf · Triggers Improving Availability Through Event-driven Automation ... (cron,nagios,...)

Trigger Variables

$ETYPE $OWNER

$GROUPHOSTLIST $TIME

$HOSTLIST $USER

$MASTERHOST $VPCID

$OID $VPCHOSTLIST

$OS

$OTYPE (rsv,job,node,sched)