jaime frey, todd tannenbaum computer sciences department university of wisconsin-madison...
TRANSCRIPT
![Page 1: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/1.jpg)
Jaime Frey, Todd TannenbaumComputer Sciences DepartmentUniversity of Wisconsin-Madison{jfrey|tannenba}@cs.wisc.eduhttp://www.cs.wisc.edu/condor
OGF 19Condor Software Forum
Condor-G
![Page 2: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/2.jpg)
www.cs.wisc.edu/condor
What Is It?
› Condor-G is a specialization of Condor. It is also known as the “grid universe”.
› Condor-G speaks many different job management protocols.
› Condor-G benefits from all the wonderful Condor features, like a real job queue.
![Page 3: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/3.jpg)
www.cs.wisc.edu/condor
Grid Fault-Tolerance
› Condor-G does whatever it takes to run your jobs, even if … Your local machine machine crashes The grid service is temporarily
unavailable The network goes down
![Page 4: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/4.jpg)
www.cs.wisc.edu/condor
Remote Resource Access: Globus
“globusrun myjob …”
Globus GRAM ProtocolGlobus
JobManager
fork()
Organization A Organization B
![Page 5: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/5.jpg)
www.cs.wisc.edu/condor
GlobusGlobus GRAM Protocol
Globus JobManager
fork()
Organization A Organization B
“globusrun myjob …”
![Page 6: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/6.jpg)
www.cs.wisc.edu/condor
Globus + Condor
Globus GRAM Protocol Globus JobManager
Submit to Condor
Condor PoolOrganization A Organization B
“globusrun myjob …”
![Page 7: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/7.jpg)
www.cs.wisc.edu/condor
Globus + Condor
“globusrun …”
Globus GRAM Protocol Globus JobManager
Submit to Condor
Condor PoolOrganization A Organization B
![Page 8: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/8.jpg)
www.cs.wisc.edu/condor
Condor-G + Globus + Condor
Globus GRAM Protocol Globus JobManager
Submit to Condor
Condor PoolOrganization A Organization B
Condor-GCondor-G
myjob1myjob2myjob3myjob4myjob5…
![Page 9: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/9.jpg)
www.cs.wisc.edu/condor
Condor-G Fault-Tolerance:Lost Contact with Remote
JobmanagerCan we contact gatekeeper?
Yes – network was downNo – machine crashed
or job completed
Yes - jobmanager crashed No – retry until we can talk to gatekeeper again…
Can we reconnect to jobmanager?
Has job completed?
No – is job still running?
Yes – update queue
Restart jobmanager
![Page 10: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/10.jpg)
www.cs.wisc.edu/condor
Just to be fair…
› The gatekeeper doesn’t have to submit to a Condor pool. It could be PBS, LSF, Sun Grid
Engine…
› Condor-G will work fine whatever the remote batch system is.
![Page 11: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/11.jpg)
www.cs.wisc.edu/condor
Other Condor-G Features
› Other Grid Protocols Works with WS-GRAM, NorduGrid, Unicore
› Credential Management Pull refreshed credentials from MyProxy Push refreshed credentials to remote systems
› Job Scheduling Use Matchmaking to select resources for jobs
› GlideIn Allows late binding of resources and job
checkpoint/migration
![Page 12: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/12.jpg)
www.cs.wisc.edu/condor
Condor-G
Condor-GCondor-G
Job Description (Job ClassAd)
GT2 [.1|2|4]
HTTPSCondor PBS/LSF NorduGrid
GT4
WSRFUnicore
![Page 13: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/13.jpg)
www.cs.wisc.edu/condor
Pre-WS GRAM
› Submit filegrid_resource = gt2 \ foo.edu/jobmanager-pbsglobus_rsl = (queue=long)\ (condor_submit=(universe java))
![Page 14: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/14.jpg)
www.cs.wisc.edu/condor
OGSA GRAM
› Submit filegrid_resource = gt3 http://foo.edu/\ ogsa/services/base/gram/\ PBSManagedJobFactoryServiceglobus_rsl = (queue=long)\ (condor_submit=(universe java))
› Museum mode
![Page 15: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/15.jpg)
www.cs.wisc.edu/condor
WS GRAM
› Submit filegrid_resource = gt4 foo.edu PBSglobus_xml = <queue>long</queue>
![Page 16: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/16.jpg)
www.cs.wisc.edu/condor
NorduGrid
› Submit filegrid_resource = nordugrid foo.edunordugrid_rsl = (queue=long)
![Page 17: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/17.jpg)
www.cs.wisc.edu/condor
Unicore
› Submit filegrid_resource = unicore usite.org vsitekeystore_file = keystorekeystore_passphrase_file = keystore.pwkeystore_alias = my cert
![Page 18: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/18.jpg)
www.cs.wisc.edu/condor
Condor
› Submit filegrid_resource = condor schedd.foo.edu \ cm.foo.eduremote_universe = java
![Page 19: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/19.jpg)
www.cs.wisc.edu/condor
PBS
› Submit filegrid_resource = pbs
![Page 20: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/20.jpg)
www.cs.wisc.edu/condor
LSF
› Submit filegrid_resource = lsf
![Page 21: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/21.jpg)
www.cs.wisc.edu/condor
Grid Universe Fault-Tolerance: Credential
Management› Authentication in many grid protocols is done
with limited-lifetime X509 proxies› Proxy may expire before jobs finish executing› Condor can put jobs on hold and email user to
refresh proxy› Condor can automatically retrieve new proxies
from MyProxy› When the proxy is refreshed, Condor forwards
it to the jobs
![Page 22: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/22.jpg)
www.cs.wisc.edu/condor
MyProxy
› Submit fileMyProxyHost = foo.edu:12345MyProxyServerDN = /DC=org/DC=doegrids…MyProxyCredentialName = proxy_fileMyProxyRefreshThreshold = 240 #minsMyProxyNewProxyLifetime = 12 #hrsMyProxyPassword = password
› Or give password on command linecondor_submit -p password submit.desc
![Page 23: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/23.jpg)
www.cs.wisc.edu/condor
Condor-G Matchmaking
› Use Condor-G matchmaking with grid universe jobs
› Allows Condor-G to dynamically assign computing jobs to grid sites
› An example of lazy planning
![Page 24: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/24.jpg)
www.cs.wisc.edu/condor
Condor-G Matchmaking, cont.
› Normally a grid universe job must specify the site in the submit description file via the “grid_resource” attribute like so:
Executable = fooUniverse = gridGrid_Resource = gt2 \
beak.cs.wisc.edu/jobmanager-pbsqueue
![Page 25: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/25.jpg)
www.cs.wisc.edu/condor
Condor-G Matchmaking, cont.
› With matchmaking, grid universe jobs can use requirements and rank:
Executable = fooUniverse = gridGrid_Resource = $$(ResourceName)Requirements = arch == LINUXRank = NumberOfNodes * random()Queue
› The $$(x) syntax inserts information from the target ClassAd when a match is made.
![Page 26: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/26.jpg)
www.cs.wisc.edu/condor
Condor-G Matchmaking, cont.
› Where do these target ClassAds representing Globus gatekeepers come from? Several options: Simple script on gatekeeper publishes an ad via
condor_advertise command-line utility (method used by D0 JIM, USCMS)
Program to query Globus MDS and convert information into ClassAd (method used by EDG)
Run HawkEye with appropriate plugins on the gatekeeper
› For explanation of Condor-G matchmaking setup for USCMS, see http://www.cs.wisc.edu/condor/USCMS_matchmaking.html
![Page 27: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/27.jpg)
www.cs.wisc.edu/condor
Condor-G Matchmaking: Creating
the Resource Ad› Machine AdMyType = “Machine”TargetType = “Job”Name = “foo.edu”Machine = “foo.edu”ResourceName = “gt4 foo.edu PBS”UpdateSequenceNumber = 4Requirements = TARGET.JobUniverse == 9 && \ CurMatches < 10CurMatches = 0NumberOfNodes = 300Rank = 0.0CurrentRank = 0.0WantAdRevaluate = True
![Page 28: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/28.jpg)
www.cs.wisc.edu/condor
Condor-G Matchmaking: Creating
the Resource Ad› Advertising a resourcecondor_advertise UPDATE_STARTD_AD \ ad-file
› Call periodically
› Use unix time for UpdateSequenceNumber
![Page 29: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/29.jpg)
www.cs.wisc.edu/condor
But Wait, There’s More…
› What if you want to run standard universe jobs on grid resources For matchmaking and dynamic scheduling
of jobs For job checkpointing and migration For remote system calls
› What if you don’t want to send a job to a site until the moment the job will start running (late binding)
![Page 30: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/30.jpg)
www.cs.wisc.edu/condor
One Solution: Condor-G GlideIn
› You can use the Grid Universe to run Condor daemons on grid resources
› When the resources run these GlideIn jobs, they will temporarily join your Condor Pool
› You can then submit Standard, Vanilla, PVM, or MPI Universe jobs and they will be matched and run on the grid resources
![Page 31: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/31.jpg)
www.cs.wisc.edu/condor
yourworkstation
Friendly Condor Pool
personalCondor
600 Condorjobs
Globus Grid
PBS LSF
Condor
Condor Pool
glide-in jobs
![Page 32: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/32.jpg)
www.cs.wisc.edu/condor
GlideIn Concerns
› What if a grid resource kills my GlideIn job? That resource will disappear from your pool and
your jobs will be rescheduled on other machines Standard universe jobs will resume from their
last checkpoint like usual
› What if all my jobs are completed before a GlideIn job runs? If a GlideIn Condor daemon is not matched with
a job in 10 minutes, it terminates, freeing the resource
![Page 33: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/33.jpg)
www.cs.wisc.edu/condor
Condor
schedd(Job caretaker)
condor_submit
matchmaker
Startd(Runs job)
![Page 34: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/34.jpg)
www.cs.wisc.edu/condor
Condor-G
schedd(Job caretaker)
condor_submit
gridmanager gahp
Globus gatekeeper
PBS or LSF
![Page 35: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/35.jpg)
www.cs.wisc.edu/condor
Condor-C
schedd(Job caretaker)
condor_submit
gridmanager condor-gahp
schedd
matchmaker
startd
![Page 36: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/36.jpg)
www.cs.wisc.edu/condor
Condor-C to non-Condor
schedd(Job caretaker)
condor_submit
gridmanager condor-gahp
schedd
gridmanager
pbs/lsf-gahp PBS or LSF
![Page 37: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/37.jpg)
www.cs.wisc.edu/condor
Gliding in Condor-C
schedd(Job caretaker)
condor_submit
gridmanager
gridmanager
pbs/lsf-gahp
PBS or LSFcondor-gahp
gahp
Globusgatekeeper
schedd1. Glide-in
2. Submit jobs
![Page 38: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/38.jpg)
www.cs.wisc.edu/condor
Matchmaking with Condor-C
› In all of these examples, Condor-C went to a specific remote schedd
› This is not required: you can do matchmaking
![Page 39: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/39.jpg)
www.cs.wisc.edu/condor
Matchmaking with Condor-C
schedd(Job caretaker)
condor_submit
gridmanager condor-gahp
matchmaker
schedd
schedd
… submit job
![Page 40: Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu OGF](https://reader035.vdocuments.mx/reader035/viewer/2022062404/5514d861550346935c8b5204/html5/thumbnails/40.jpg)
www.cs.wisc.edu/condor