workload management owen maroney, imperial college london (with a little help from david colling)
Post on 01-Apr-2015
218 Views
Preview:
TRANSCRIPT
Workload management
Owen Maroney, Imperial College London
(with a little help from David Colling)
Contents
• Brief review of the WMS architecture used in LCG2.
• Future UK plans in WMS area.
WMS used in LCG2:
• EDG release 2(.1) architecture
• Slightly hardened and made more robust
But appears to be reliable and scalable to current levels of LCG-2
• Uses (modified) bdII instead of RGMA (gin/gout)
Strictly speaking this is a monitoring issue rather than a WMS issue.
Now takes less time to submit jobs
WMS used in LCG2:
Description that follows was shown at GridPP7 and mainly taken from an even earlier presentation by Massimo Sgaravatto. So this is just a reminder, however there have been no changes in the basic architecture between then and LCG2.
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
ReplicaCatalog
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
edg-job-submit myjob.jdlMyjob.jdl
JobType = “Normal”;Executable = "$(CMS)/exe/sum.exe";InputData = "LF:testbed0-00019";InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"};OutputSandbox = {“sim.err”, “test.out”, “sim.log"};Requirements = other. GlueHostOperatingSystemName == “linux" && other. GlueHostOperatingSystemRelease == "Red Hat 7.3“ && other.GlueCEPolicyMaxWallClockTime > 10000;Rank = other.GlueCEStateFreeCPUs;
submitted
Job Status
UI: allows users to access the functionalitiesof the WMS
Job Description Language(JDL) to specify job characteristics and requirements
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
ReplicaCatalog
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
Input Sandboxfiles
Job
waiting
submitted
Job Status
NS: network daemon responsible for acceptingincoming requests
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
ReplicaCatalog
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
waiting
submitted
Job Status
WM: responsible to takethe appropriate actions to satisfy the request
Job
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
ReplicaCatalog
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
waiting
submitted
Job Status
Match-maker
Where does thisjob must be executed ?
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
ReplicaCatalog
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
waiting
submitted
Job Status
Match-Maker/ Broker
Matchmaker: responsible to find the “best” CE where to submit a job
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
ReplicaCatalog
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
waiting
submitted
Job Status
Match-Maker/ Broker
Where are (which SEs) the needed data ?
What is thestatus of the
Grid ?
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
ReplicaCatalog
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
waiting
submitted
Job Status
Match-maker
CE choice
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
ReplicaCatalog
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
waiting
submitted
Job Status
JobAdapter
JA: responsible for the final “touches” to the job before performing submission(e.g. creation of wrapper script, etc.)
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
ReplicaCatalog
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
Job Status
JC: responsible for theactual job managementoperations (done via CondorG)
Job
submitted
waiting
ready
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
ReplicaCatalog
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
Job Status
Job
InputSandboxfiles
submitted
waiting
ready
scheduled
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
ReplicaCatalog
Inform.Service
ComputingElement
StorageElement
RB node
RBstorage
Job Status
InputSandbox
submitted
waiting
ready
scheduled
running
“Grid enabled”data transfers/
accesses
Job
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
ReplicaCatalog
Inform.Service
ComputingElement
StorageElement
RB node
RBstorage
Job Status
OutputSandboxfiles
submitted
waiting
ready
scheduled
running
done
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
ReplicaCatalog
Inform.Service
ComputingElement
StorageElement
RB node
RBstorage
Job Status
OutputSandbox
submitted
waiting
ready
scheduled
running
done
edg-job-get-output <dg-job-id>Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
ReplicaCatalog
Inform.Service
ComputingElement
StorageElement
RB node
RBstorage
Job Status
OutputSandboxfiles
submitted
waiting
ready
scheduled
running
done
cleared
Job submission
UI
Log Monitor
Logging &Bookkeeping
NetworkServer
Job Contr.-
CondorG
WorkloadManager
ComputingElement
RB node
LM: parses CondorG logfile (where CondorG logsinfo about jobs) and notifies LB
LB: receives and stores job events; processes corresponding job status
Log ofjob events
edg-job-status <dg-job-id>
Job status
Logging and bookkeeping.
Future UK plans The WMS will be change with ARDA(e.g. will go to pull rather push model for job distribution)
UK emphasis is going to be on testing scalability
Plan is:Instrument WMS code
Build testbed (between Imperial HEP and LeSC) capable of simulating the load of entire LCG
Understand the characteristics of different sorts of (HEP) job and feed this into simulation.
Also Plan:To examine and understand the performance of the WMS in operation.
Future UK plans
Details of the testbed construction to be worked out, however this effort will be integrated into the EGEE/LCG testplan.
This effort also neatly dovetails into the GridCC project (see talk at GridPP11?)
top related