matchmaking in glideinwms in cms
DESCRIPTION
This document provides a high level overview of how glideinWMS-based instanced do matchmaking in CMS (a High Energy Experiment). The information is accurate as of early Dec 2012.TRANSCRIPT
CERN, Dec 2012 glideinWMS matchmaking 1
glideinWMS for users
Matchmaking in glideinWMSin CMS
by Igor Sfiligoi (UCSD)
CERN, Dec 2012 glideinWMS matchmaking 2
Scope of this talk
This talk provides a high level description of howglideinWMS matchmaking
works in CMS.
Reader is expected to be familiar with the CMS experiment environmenthttp://cms.web.cern.ch/
CERN, Dec 2012 glideinWMS matchmaking 3
glideinWMS architecture
● A reminder
Central manager
Negotiator
Submit node
Schedd
Execute node
Condor
Submit node
Submit node
Execute node
Execute node
Execute node
Execute node
Grid
G.F.
G.F.VO FE
+3
+1
CERN, Dec 2012 glideinWMS matchmaking 4
Two levels of matchmaking
● First in the VO Frontend● To decide where
to provision resources● i.e. where
to send glideins
● Then in the HTCondor Negotiator● To decide
which Job gets the glidein Slot
Central manager
Negotiator
Submit node
Schedd
Execute node
Condor
Submit node
Submit node
Execute node
Execute node
Execute node
Execute node
Grid
G.F.
G.F.VO FE
+3
+1
The two
must havecompatible
policies
CERN, Dec 2012 glideinWMS matchmaking 5
Defining the policy
● The VO FE configures the glideins● So it can define the Slot Requirements
● Preferred strategy to leave all policy decisions in the VO FE hands, i.e. both● VO FE matchmaking policy● HTCondor matchmaking policy
● This implies● Users should not define Job Requirements● Instead, publish attributes describing requirements
Easier keep themin sync this way
http://www.slideshare.net/igor_sfiligoi/condor-week-12-attribute-matchmaking-move-req-out-of-user-hands
CERN, Dec 2012 glideinWMS matchmaking 6
CMS Production @ CERNPolicies
CERN, Dec 2012 glideinWMS matchmaking 7
Description
● The VO FE @ CERN serves the production needs● i.e. Reconstruction and MC production
● Job submission regulated by service managed by a dedicated team, so jobs are● Targeted● Well behaved
At least by and large
CERN, Dec 2012 glideinWMS matchmaking 8
Matchmaking policy
● Two dimensions● Grid Site● Single CPU vs HTPC
● The actual policy is the AND of both● Both VO FE policy and HTCondor policy
defined in the VO FE instance configuration
CERN, Dec 2012 glideinWMS matchmaking 9
Matching on Grid site name
● User Jobs expected to publish the attributeDESIRED_Sites● e.g. +DESIRED_Sites = “T2_DE_DESY,T2_US_UCSD”
● The G.F. and the glideins advertisingGLIDEIN_CMSSite
● The matchmaking policy isGLIDEIN_CMSSite ∈ DESIRED_Sites
String list
CERN, Dec 2012 glideinWMS matchmaking 10
Matching on Job Type
● Use Jobs can publish the attributeDESIRES_HTPC● e.g. +DESIRES_HTPS = 1● If not defined, defaults to 0
● The G.F. And the glideins may advertiseGLIDEIN_Is_HTPC● If not defined, defaults to False
● The matchmaking policy is(GLIDEIN_Is_HTPC==True)==(DESIRES_HTPC==1)
Integer representation of Boolean values
Boolean value
CERN, Dec 2012 glideinWMS matchmaking 11
Example submit file
Universe = vanillaExecutable = mcgenArguments = -k 1543.3Output = mcgen.outError = mcgen.errLog = mcgen.log+DESIRED_Sites = “T2_DE_DESY,T2_US_UCSD”+DESIRES_HTPC = 0Requirements = TrueQueue 1
Universe = vanillaExecutable = mcgenArguments = -k 1543.3Output = mcgen.outError = mcgen.errLog = mcgen.log+DESIRED_Sites = “T2_DE_DESY,T2_US_UCSD”+DESIRES_HTPC = 0Requirements = TrueQueue 1
CERN, Dec 2012 glideinWMS matchmaking 12
CMS AnaOps @ UCSDPolicies
CERN, Dec 2012 glideinWMS matchmaking 13
Description
● VO FE @ UCSD serves CMS analysis users● User Jobs much more chaotic
● Most users don't really understand their needs● Must protect from accidental errors● Yet keep the system flexible
● Net result● More complex policy
CERN, Dec 2012 glideinWMS matchmaking 14
Two different policies
● The AnaOps FE actually has two policies● The Regular policy● The Overflow policy
● The Regular policy tries to match resources● Based on User desires
● The Overflow policy “outsmarts” the Users● Will violate User desires without breaking the Jobs● The aim is to finish user jobs sooner● User can opt-out, if he wishes
CERN, Dec 2012 glideinWMS matchmaking 15
The Regular M.M. policy
● Four+one dimensions● Grid Site● Single CPU vs HTPC● Memory usage● Job duration● Number of Job Starts
● The actual policy is the AND of both● Both VO FE policy and HTCondor policy
defined in the VO FE instance configuration
Due to preemption
CERN, Dec 2012 glideinWMS matchmaking 16
Grid site selection
● This is both similar and different compared to the Production FE @CERN● Serves the same purpose, but supports three
different ways to select a site– Due to historical evolution
● The three options are● GLIDEIN_CMSSite ∈ DESIRED_Sites● GLIDEIN_SEs ∈ DESIRED_SEs● GLIDEIN_Gatekeeper ∈ DESIRED_Gatekeepers
● The actual policy is the OR of the three
Planning to extend to(GLIDEIN_SEs ∩ DESIRED_SEs) ≠∅
CERN, Dec 2012 glideinWMS matchmaking 17
Job type selection
● Just like @ CERN
CERN, Dec 2012 glideinWMS matchmaking 18
Memory Usage
● Most Grid sites put strict limits on the amount of memory that can be used● Will kill glideins if they exceed the limit
● G.F. and glideins advertise the Entry-specific limitGLIDEIN_MaxMemMBs
● Jobs can explicitly declare the needed memoryrequest_memory● Condor will also measure it at run time
– ImageSize – Virtual memory used– ResidentSetSize – True memory usage
● Policy: JobMemory <= GLIDEIN_MaxMemMBs
Native Condor attribute, no + needed
Use a combinationof these to calculatethe actual JobMemory
CERN, Dec 2012 glideinWMS matchmaking 19
Job Duration 1/2
● Glideins have a limited lifetime● Must fit within the limits of the Grid site's queue● Glideins publish the deadlineGLIDEIN_ToDie– Jobs must finish before reaching the deadline
● Final user job lifetime unpredictable● Depends on the type of computing done● User should indicate the expected job lifetime
– Else we have to assume reasonable defaults
Not many users setthis value(s) right now
CERN, Dec 2012 glideinWMS matchmaking 20
Job Duration 2/2
● The same type of computation may take different amount of time● e.g. Based on the type of input
● Jobs can declare two attributes● NormMaxWallTimeMins – Expected limit● MaxWallTimeMins – Absolute max limit
● The matchmaking logic is● Use NormMaxWallTimeMins for
the first job startup● Use MaxWallTimeMins for all others
Based on simple assumptionthat the job was killed for
hitting the deadline.
CERN, Dec 2012 glideinWMS matchmaking 21
Cut on number of re-starts
● Not really a user configurable property● More an emergency break
● In a properly configured system,should never be triggered● But unexpected problems happen● So better limit the damage
CERN, Dec 2012 glideinWMS matchmaking 22
The Overflow Use case
● User Jobs specify a list of sites, because the data they need is there
● With recent versions of CMSSW, jobs can access the data from remote● With a small performance penalty
● We can thus schedule jobs “anywhere”● As long as the needed data is
at a Site that has joined the xrootd federation● But only if no CPU available “close to the data”
– And not too far, eitherhttp://indico.cern.ch/contributionDisplay.py?contribId=381&sessionId=5&confId=149557http://indico.cern.ch/contributionDisplay.py?contribId=232&sessionId=8&confId=149557
CERN, Dec 2012 glideinWMS matchmaking 23
The Overflow M.M. policy
● Violate only the “Site selection” rule● Keep all the others
● Plus, add one+one more:● An opt-out mechanism● Delayed matching
CERN, Dec 2012 glideinWMS matchmaking 24
New Site M.M. policy
● The user specified attribute is used to flag the job as “Overflowable”● i.e. the job will match if and only if
(DESIRED_<site>s ∩ SUPPORTED_<site>s) ≠∅
● Matching jobs can then run on any glidein● Additional limits can be put in place by the FE,
but mostly invisible to the user
Still support all 3 types of site identification
CERN, Dec 2012 glideinWMS matchmaking 25
The opt-out mechanism
● The Overflow policy considers all jobs by default● But Users may want to opt-out some of the Jobs
– Sometimes it is just a need(to get deterministic results, e.g. for testing a site)
● To opt-out, the user defines+CMS_ALLOW_OVERFLOW = False
● The FE will not consider such jobs for Overflowing
CERN, Dec 2012 glideinWMS matchmaking 26
Delayed matching
● As said initially, Jobs should preferentially run close to the data● Overflow should only consider jobs
“that cannot find resources close to the data”
● We implemented it based on time● Jobs are matched only
if waiting in the queue for more than 6 hours
Users cannot influence it
CERN, Dec 2012 glideinWMS matchmaking 27
Example submit file
Universe = vanillaExecutable = myanaArguments = -k 1543.3Output = myana.outError = myana.errLog = myana.logrequest_memory = 1500+DESIRED_SEs = "dc2-grid-64.brunel.ac.uk,stormfe1.pi.infn.it"+NormMaxWallTimeMins = 7200+MaxWallTimeMins = 14400+DESIRES_HTPC = 0+CMS_ALLOW_OVERFLOW = TrueRequirements = TrueQueue 1
Universe = vanillaExecutable = myanaArguments = -k 1543.3Output = myana.outError = myana.errLog = myana.logrequest_memory = 1500+DESIRED_SEs = "dc2-grid-64.brunel.ac.uk,stormfe1.pi.infn.it"+NormMaxWallTimeMins = 7200+MaxWallTimeMins = 14400+DESIRES_HTPC = 0+CMS_ALLOW_OVERFLOW = TrueRequirements = TrueQueue 1
CERN, Dec 2012 glideinWMS matchmaking 28
The End
CERN, Dec 2012 glideinWMS matchmaking 29
Pointers
● glideinWMS Home Pagehttp://tinyurl.com/glideinWMS
● HTCondor Home Pagehttp://research.cs.wisc.edu/htcondor/
● HTCondor [email protected]@cs.wisc.edu
● glideinWMS [email protected]
CERN, Dec 2012 glideinWMS matchmaking 30
Acknowledgments
● The creation of this document was sponsored by grants from the US NSF and US DOE,and by the University of California system