condor project computer sciences department university of wisconsin-madison [email protected]...
TRANSCRIPT
![Page 1: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/1.jpg)
Condor ProjectComputer Sciences DepartmentUniversity of Wisconsin-Madison
[email protected]://www.cs.wisc.edu/condor
Eager, Lazy, and Just-in-Time
Planning Edinburgh Workshop
Oct 2003
![Page 2: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/2.jpg)
2http://www.cs.wisc.edu/condor
Planning –vs- Scheduling
› Can you control the resources? Yes? Scheduling. No? Planning.
› Planning is a ‘client’ operation.
![Page 3: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/3.jpg)
3http://www.cs.wisc.edu/condor
The question of When
› Lots of planning open questions.
› An important consideration: When the planning occurs.
Time
Eager Just-in-TimeLazy
![Page 4: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/4.jpg)
4http://www.cs.wisc.edu/condor
Eager Example› First Pass of EDG
Resource Broker
RB DAGMan
Condor-G
Globus
Fabric
Site Scheduler
![Page 5: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/5.jpg)
5http://www.cs.wisc.edu/condor
Eager Condor-G Submit File
universe = globus
globussite = beak.cs.wisc.edu/jobmanager-lsf
executable = find_particlearguments = ….output = ….log = …
![Page 6: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/6.jpg)
6http://www.cs.wisc.edu/condor
EDG Resource Broker Gets Lazy…
› Addition of a DAGMan callouts› DAGMan is given a command (script) to run
immediately before submission of job to Condor-G (different than a PRE script on a node)
› The helper command is passed a copy of the job submit file when DAGMan is about to submit that node in the graph
› This allows changes to be made to the submit file (i.e. changing globussite attribute) at the last minute
![Page 7: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/7.jpg)
7http://www.cs.wisc.edu/condor
Eager Example› First Pass of EDG
Resource Broker
RB DAGMan
Condor-G
Globus
Fabric
Site Scheduler
callout
![Page 8: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/8.jpg)
8http://www.cs.wisc.edu/condor
Moving Condor-G to Just-In-Time
› Delay the binding of the task (job) to the resource until the resource is ready.
› Need to know when the resource is ready.
› One way: unimplemented globus 1.1 “queue wait time” estimate Not really just-in-time, because of lies, lies
lies…
› Another way… Condor-G Glidein Mechanism.
![Page 9: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/9.jpg)
9http://www.cs.wisc.edu/condor
How It Works
ScheddSchedd
LSFLSF
CollectorCollector
Condor-G Globus Resource
600 Condorjobs
![Page 10: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/10.jpg)
10http://www.cs.wisc.edu/condor
How It Works
ScheddSchedd
LSFLSF
CollectorCollector
Condor-G Globus Resource
600 Condorjobs
GlideIn jobs
![Page 11: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/11.jpg)
11http://www.cs.wisc.edu/condor
How It Works
ScheddSchedd
LSFLSF
CollectorCollector
Condor-G Globus Resource
GridManagerGridManager
600 Condorjobs
GlideIn jobs
![Page 12: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/12.jpg)
12http://www.cs.wisc.edu/condor
How It Works
ScheddSchedd JobManagerJobManager
LSFLSF
CollectorCollector
Condor-G Globus Resource
GridManagerGridManager
600 Condorjobs
GlideIn jobs
![Page 13: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/13.jpg)
13http://www.cs.wisc.edu/condor
How It Works
ScheddSchedd JobManagerJobManager
LSFLSF
StartdStartd
CollectorCollector
Condor-G Globus Resource
GridManagerGridManager
600 Condorjobs
GlideIn jobs
![Page 14: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/14.jpg)
14http://www.cs.wisc.edu/condor
How It Works
ScheddSchedd JobManagerJobManager
LSFLSF
StartdStartd
CollectorCollector
Condor-G Globus Resource
GridManagerGridManager
600 Condorjobs
GlideIn jobs
![Page 15: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/15.jpg)
15http://www.cs.wisc.edu/condor
How It Works
ScheddSchedd JobManagerJobManager
LSFLSF
User JobUser Job
StartdStartd
CollectorCollector
Condor-G Globus Resource
GridManagerGridManager
600 Condorjobs
GlideIn jobs
![Page 16: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/16.jpg)
16http://www.cs.wisc.edu/condor
A Just-in-time Submit
executable = find_particlerequirements = TARGET.Arch ==
“Intel/Linux” || TARGET.Arch == “Sparc/Solaris”
# job describes the “power”rank = MFlops * 10000 + Memory
![Page 17: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/17.jpg)
17http://www.cs.wisc.edu/condor
Another Just-in-time Submit
executable = find_particlerequirements = TARGET.Arch ==
“Intel/Linux” || TARGET.Arch == “Sparc/Solaris”
rank = sam_data_overlap(MY.dataset,TARGET.sam_site_name) + (TARGET.Mflops / 100000)
+dataset = search_space_id_0133313
![Page 18: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/18.jpg)
18http://www.cs.wisc.edu/condor
Lots of Tradeoffs…› Just-in-Time
Pro: Dynamic. Resources can come and go. Can take advantage of changing circumstances.
Con: Coordination of multiple resources
› Eager Pro: Easier to coordinate multiple resources Con: Hard to scale… how to know about all
the resources in advance? Con: Plan falls apart if assumptions change.
![Page 19: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/19.jpg)
19http://www.cs.wisc.edu/condor
Some observations› A complete separation of task from
resource is difficult. Lots and lots of structured data required. But this separation is required to in order to
achieve Just-In-Time planning.
› Grid Protocols that do not separate task from resource cannot realistically live on the grid. Virtualization can help.
![Page 20: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/20.jpg)
20http://www.cs.wisc.edu/condor
Plan for failure
› Much effort on how to create a plan.
› How about a plan for when things fail?
![Page 21: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/21.jpg)
21http://www.cs.wisc.edu/condor
Job Failure Policy Expressions
› Condor/Condor-G augemented so users can supply job failure policy expressions in the submit file.
› Can be used to describe a successful run, or what to do in the face of failure.
on_exit_remove = <expression>on_exit_hold = <expression>periodic_remove = <expression>periodic_hold = <expression>
![Page 22: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Eager, Lazy, and Just-in-Time](https://reader035.vdocuments.mx/reader035/viewer/2022062312/5515ecd0550346cf6f8b5208/html5/thumbnails/22.jpg)
22http://www.cs.wisc.edu/condor
Job Failure Policy Examples› Do not remove from queue (i.e. reschedule) if
exits with a signal:on_exit_remove = ExitBySignal == False
› Place on hold if exits with nonzero status or ran for less than an hour:
on_exit_hold = ((ExitBySignal==False) && (ExitSignal != 0)) || ((ServerStartTime –
JobStartDate) < 3600)› Place on hold if job has spent more than 50% of
its time suspended:periodic_hold = CumulativeSuspensionTime
> (RemoteWallClockTime / 2.0)