monitoring and troubleshooting a glideinwms-based htcondor pool

32
CERN, Dec 2012 glideinWMS monitoring 1 glideinWMS for users Monitoring and troubleshooting a glideinWMS-based HTCondor pool by Igor Sfiligoi (UCSD)

Upload: igor-sfiligoi

Post on 15-Jan-2015

284 views

Category:

Technology


1 download

DESCRIPTION

A guide for users of glideinWMS-based HTCondor pools on how to monitor the system, and troubleshoot the most common problems.

TRANSCRIPT

Page 1: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 1

glideinWMS for users

Monitoring and troubleshooting

a glideinWMS-basedHTCondor pool

by Igor Sfiligoi (UCSD)

Page 2: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 2

Scope of this talk

This talk describes whatinformation are available when troubleshooting in a

glideinWMS-based HTCondor pool,and what tools can you use

to mine them.

Reader is expected to already have a basic understanding of HTCondor and glideinWMS.

Page 3: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 3

HTCondor Architecture

● As a reminder

Central manager

Negotiator

Submit node

Schedd

Execute node

Condor

Submit node

Submit node

Execute node

Execute node

Execute node

Execute node

Grid

G.F.

G.F.VO FE

+3

+1

Page 4: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 4

Typical user questionsaddressed in this talk

● Where is/was my job running?● Why are my jobs

not starting?● Why do my jobs

take forever to finish?

Page 5: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 5

Where is/was my job running?

Page 6: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 6

Job progress monitoring

● HTCondor provides two basic means to monitor job progress● Querying the system for current status

– Using the cmdline condor_q/condor_history● Parsing the job event log

– Either plain text or XML formatted– Starting with 7.9.1, condor_history can be used

to extract the last known state

Page 7: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 7

Job status

● Each Job has a status associated with it● An integer attribute calledJobStatus– But has well known semantics

associated with each value

● Jobs start in the Idle state● Become Running if everything works fine● Completed when they terminate

● If anything goes wrong, a Job will go into Hold● If removed before completion, will be Removed

Page 8: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 8

Monitoring the Job Status

● Idle/Running/Held jobs can be polled withcondor_q● Will query the Schedd daemon

● Once they terminate, or are removed,they leave the Schedd queue● Are put into a file on disk● Can use condor_history

to retrieve the last ClassAd

● The job event log has all the state transitions(of course)

One exception:If a job was running when it was removed, but the execute nodedoes not confirm the job was killed remotely, the job will be kept in the Schedd.

Page 9: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 9

So, where is the job running?

● Easy to get the machine name and/or IP● Standard HTCondor attributeRemoteHost & StartdIpAddr

● But may not necessary make sense● Do you recognize all network domains?● And they could be on a private network!

Page 10: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 10

Getting glidein attributes

● Glideins have many more attributes that describe them● e.g. symbolic site name

GLIDEIN_CMSSite

● However, by default, you do not get this info in the Job Classad

● But easy to add● <my attr> = $$(<glidein attr>:Unknown)

– Will get the info in MATCH_EXP_<my attr>

Page 11: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 11

Standard attributes

● Standard glideinWMS attributes● JOB_GLIDEIN_Entry_Name = "$$(GLIDEIN_Entry_Name:Unknown)"

● JOB_GLIDEIN_Name = "$$(GLIDEIN_Name:Unknown)"

● JOB_GLIDEIN_Factory = "$$(GLIDEIN_Factory:Unknown)"

● JOB_GLIDEIN_Schedd = "$$(GLIDEIN_Schedd:Unknown)"

● JOB_GLIDEIN_ClusterId = "$$(GLIDEIN_ClusterId:...)"

● JOB_GLIDEIN_ProcId = "$$(GLIDEIN_ProcId:Unknown)"

● JOB_GLIDEIN_Site = "$$(GLIDEIN_Site:Unknown)"

● Standard CMS glideinWMS attribute● JOB_CMSSite = "$$(GLIDEIN_CMSSite:Unknown)"

Usefulfor in-depthdebugging

Configured by the HTCondor admin,no need for the user to do anythingSUBMIT_EXPRS = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, ...

Page 12: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 12

Getting them in the event log

● You (or the admins) can also propagate the attributes into the event logjob_ad_information_attrs = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, …

● As a result you get “Job Ad” events

...001 (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749>...028 (20327.002.000) 12/03 00:46:33 Job ad information event triggered.TriggerEventTypeNumber = 1Cluster = 20327EventTypeNumber = 28ExecuteHost = "<193.48.85.94:38749>"

JOB_CMSSite = "T2_FR_IPHC"EventTime = "2012-12-03T00:46:33"TriggerEventTypeName = "ULOG_EXECUTE"Proc = 2Subproc = 0CurrentTime = time()MyType = "ExecuteEvent"...

...001 (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749>...028 (20327.002.000) 12/03 00:46:33 Job ad information event triggered.TriggerEventTypeNumber = 1Cluster = 20327EventTypeNumber = 28ExecuteHost = "<193.48.85.94:38749>"JOB_CMSSite = "T2_FR_IPHC"EventTime = "2012-12-03T00:46:33"TriggerEventTypeName = "ULOG_EXECUTE"Proc = 2Subproc = 0CurrentTime = time()MyType = "ExecuteEvent"...

Page 13: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 13

Why is my jobnot starting?

Page 14: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 14

Troubleshooting process

● First question● Do my jobs match any (logical) resource?

● Once you are sure of that● Are there jobs from higher priority users?● Are desired sites just too busy?● Are there problems at desired site(s)?

● If nothing gives a satisfying answer● It may be a glideinWMS misconfiguration,

see help from VO FE admins

Page 15: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 15

How do I know if my jobs match?

● Good question!● Unfortunately, the answer is not trivial

● The FE matching policy not “public”● And, of course, no tools to probe for it

● You will have to rely on the FE admins to “explain” the policy● Hopefully in a human readable format● Hopefully without conversion errors!

Page 16: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 16

An example FE policy

● See the CMS FE talk for an actual high level view

● The actual FE policy is a python expression

● And then there is the matching HTCondor one

(glidein["attrs"]["GLIDEIN_CMSSite"] in job["DESIRED_Sites"].split(",")) and((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True") == (job.get("DESIRES_HTPC")==1))

(glidein["attrs"]["GLIDEIN_CMSSite"] in job["DESIRED_Sites"].split(",")) and((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True") == (job.get("DESIRES_HTPC")==1))

A simple example – could be much more complex

(stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) &&((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True))

(stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) &&((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True))

Page 17: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 17

A word about HTCondor matching

● Once glideins start, you can probe their policycondor_status -format '%s' START

● But no tools to help you understand the M.M.● The closest iscondor_q -analyze – But only looks at Job requirements– So, not really helping when all/most of the policy in glideins

$ condor_status -format '%s\n' START( ( true ) && ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",") =?= true ) && ( ( GLIDEIN_Is_HTPC =?= true ) =?= ( DESIRES_HTPC =?= true ) ) ) ) && ( ( ( GLIDEIN_ToRetire =?= undefined ) || ( CurrentTime < GLIDEIN_ToRetire ) ) )( ( true ) && ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",") =?= true ) && ( ( GLIDEIN_Is_HTPC =?= true ) =?= ( DESIRES_HTPC =?= true ) ) ) ) && ( ( ( GLIDEIN_ToRetire =?= undefined ) || ( CurrentTime < GLIDEIN_ToRetire ) ) )...

$ condor_status -format '%s\n' START( ( true ) && ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",") =?= true ) && ( ( GLIDEIN_Is_HTPC =?= true ) =?= ( DESIRES_HTPC =?= true ) ) ) ) && ( ( ( GLIDEIN_ToRetire =?= undefined ) || ( CurrentTime < GLIDEIN_ToRetire ) ) )( ( true ) && ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",") =?= true ) && ( ( GLIDEIN_Is_HTPC =?= true ) =?= ( DESIRES_HTPC =?= true ) ) ) ) && ( ( ( GLIDEIN_ToRetire =?= undefined ) || ( CurrentTime < GLIDEIN_ToRetire ) ) )...

Page 18: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 18

User priorities

● So, jobs should be matching, but are not starting● And there are plenty matching glideins in the system

● Likely there are other higher-priority jobs in the system● Possibly from a different usercondor_userio

● Possibly on a different scheddcondor_status -submitters

● No tools to give you the easy answer● If you need the answer, you will have to investigate

Warning: Slow!

Page 19: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 19

Unclaimed glideins

● If you see plenty of Unclaimed glideins,but no matching jobs from other users● You have either reached the schedd limitMAX_JOBS_RUNNING

● Or something bad is going on!

● You can only ask yout FE admin for help● But first double check that your jobs should

indeed be matching, at least on paper

Page 20: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 20

Supported Sites

● What should you do if there are no (new) glideins coming from an expected site?

● First off, see if the site is even supported by the glideinWMS instance!

● Each Entry has a ClassAdcondor_status -any -const 'MyType==”glideresource”'

● Look for the attributes your FE is matching one.g. GLIDEIN_CMSSite

Sitenot there?Notify yourFE admin!

Page 21: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 21

Is the FE even asking for them?

● You are sure that your jobs should be matching?● But what if you are wrong?

● Check it out… -format '%i\n' GlideFactoryMonitorRequestedIdle

But remember it is

not just yourjobs.

Page 22: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 22

Maybe the site is just busy?

● Glideins have to compete with other Grid jobs at most sites● Maybe the site is just busy?

● Check if glideinWMS has put any glideins in the Grid queues… -format '%i\n' GlideFactoryMonitorStatusPending

If you findzeros,

notify yourFE admin!

Page 23: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 23

Site problems?

● The glideins will validate the worker node before talking to the C.M.● If the test fails, the glidein will “waste” 20 mins on

the node to prevent other jobs to fail on it again

● You can check if there are “Running” glideins in glideinWMS, even though you see none (or few) in the C.M.… -format '%i\n' GlideFactoryMonitorStatusRunning

If you finda discrepancy,

notify yourFE admin!

Page 24: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 24

Still no clue?

● If all your detective work fails● Notify your VO FE admin

● They have access to information you don't

Page 25: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 25

Why do my jobstake forever to finish?

Page 26: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 26

My jobs are running, but...

● Great, your jobs are happily running● But you are getting no results back!● i.e. the jobs are not finishing in the expected time

● Two main likely reasons● They are being restarted● You miscalculated the needed time

Page 27: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 27

Jobs re-starting

● HTCondor tries to be user friendly● If a job gets preempted, for almost any reason,

it will try to re-start it with the hope it will finish on the next try

● And will not ever give up! (by default)

● You can easily check how many times it startedcondor_q -format '%i\n' NumJobStarts

● You may want to cap the number withperiodic_hold/remove

http://research.cs.wisc.edu/htcondor/manual/v7.8/condor_submit.html#condor-submit-periodic-removehttp://research.cs.wisc.edu/htcondor/manual/v7.8/3_3Configuration.html#param:SystemPeriodicRemove

Page 28: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 28

Why is it restarting?

● OK, I now know it is restarting... but why?● Most likely, the glidein was killed

● Was it due to your job “misbehaving”?

● Most Grid sites have limits on resource use● Including CPU, memory and disk● If you exceed them, the glidein (and you) will be killed

● Glideins should be configured to detect and hold/remove your job if you “misbehave”● Thus you would not be re-started● If you see many restart, notify your FE admin

Likely there is a policy rule missing

Page 29: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 29

What is my job doing?

● What if it is not restarting... just running forever(or until hitting the time limit)

● HTCondor allows for peeking at a running job● A cmdline tool calledcondor_ssh_to_job

● Unfortunately, needs implicit permission from site– And about half of the sites don't allow it

Page 30: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 30

The End

Page 31: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 31

Pointers

● glideinWMS Home Pagehttp://tinyurl.com/glideinWMS

● HTCondor Home Pagehttp://research.cs.wisc.edu/htcondor/

● HTCondor [email protected]@cs.wisc.edu

● glideinWMS [email protected]

Page 32: Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 32

Acknowledgments

● The creation of this document was sponsored by grants from the US NSF and US DOE,and by the University of California system