workload management owen maroney, imperial college london (with a little help from david colling)

21
Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

Upload: dario-crace

Post on 01-Apr-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

Workload management

Owen Maroney, Imperial College London

(with a little help from David Colling)

Page 2: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

Contents

• Brief review of the WMS architecture used in LCG2.

• Future UK plans in WMS area.

Page 3: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

WMS used in LCG2:

• EDG release 2(.1) architecture

• Slightly hardened and made more robust

But appears to be reliable and scalable to current levels of LCG-2

• Uses (modified) bdII instead of RGMA (gin/gout)

Strictly speaking this is a monitoring issue rather than a WMS issue.

Now takes less time to submit jobs

Page 4: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

WMS used in LCG2:

Description that follows was shown at GridPP7 and mainly taken from an even earlier presentation by Massimo Sgaravatto. So this is just a reminder, however there have been no changes in the basic architecture between then and LCG2.

Page 5: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

ReplicaCatalog

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

edg-job-submit myjob.jdlMyjob.jdl

JobType = “Normal”;Executable = "$(CMS)/exe/sum.exe";InputData = "LF:testbed0-00019";InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"};OutputSandbox = {“sim.err”, “test.out”, “sim.log"};Requirements = other. GlueHostOperatingSystemName == “linux" && other. GlueHostOperatingSystemRelease == "Red Hat 7.3“ && other.GlueCEPolicyMaxWallClockTime > 10000;Rank = other.GlueCEStateFreeCPUs;

submitted

Job Status

UI: allows users to access the functionalitiesof the WMS

Job Description Language(JDL) to specify job characteristics and requirements

Page 6: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

ReplicaCatalog

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

Input Sandboxfiles

Job

waiting

submitted

Job Status

NS: network daemon responsible for acceptingincoming requests

Job submission

Page 7: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

ReplicaCatalog

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

waiting

submitted

Job Status

WM: responsible to takethe appropriate actions to satisfy the request

Job

Job submission

Page 8: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

ReplicaCatalog

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

waiting

submitted

Job Status

Match-maker

Where does thisjob must be executed ?

Job submission

Page 9: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

ReplicaCatalog

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

waiting

submitted

Job Status

Match-Maker/ Broker

Matchmaker: responsible to find the “best” CE where to submit a job

Job submission

Page 10: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

ReplicaCatalog

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

waiting

submitted

Job Status

Match-Maker/ Broker

Where are (which SEs) the needed data ?

What is thestatus of the

Grid ?

Job submission

Page 11: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

ReplicaCatalog

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

waiting

submitted

Job Status

Match-maker

CE choice

Job submission

Page 12: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

ReplicaCatalog

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

waiting

submitted

Job Status

JobAdapter

JA: responsible for the final “touches” to the job before performing submission(e.g. creation of wrapper script, etc.)

Job submission

Page 13: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

ReplicaCatalog

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

Job Status

JC: responsible for theactual job managementoperations (done via CondorG)

Job

submitted

waiting

ready

Job submission

Page 14: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

ReplicaCatalog

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

Job Status

Job

InputSandboxfiles

submitted

waiting

ready

scheduled

Job submission

Page 15: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

ReplicaCatalog

Inform.Service

ComputingElement

StorageElement

RB node

RBstorage

Job Status

InputSandbox

submitted

waiting

ready

scheduled

running

“Grid enabled”data transfers/

accesses

Job

Job submission

Page 16: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

ReplicaCatalog

Inform.Service

ComputingElement

StorageElement

RB node

RBstorage

Job Status

OutputSandboxfiles

submitted

waiting

ready

scheduled

running

done

Job submission

Page 17: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

ReplicaCatalog

Inform.Service

ComputingElement

StorageElement

RB node

RBstorage

Job Status

OutputSandbox

submitted

waiting

ready

scheduled

running

done

edg-job-get-output <dg-job-id>Job submission

Page 18: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

ReplicaCatalog

Inform.Service

ComputingElement

StorageElement

RB node

RBstorage

Job Status

OutputSandboxfiles

submitted

waiting

ready

scheduled

running

done

cleared

Job submission

Page 19: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

UI

Log Monitor

Logging &Bookkeeping

NetworkServer

Job Contr.-

CondorG

WorkloadManager

ComputingElement

RB node

LM: parses CondorG logfile (where CondorG logsinfo about jobs) and notifies LB

LB: receives and stores job events; processes corresponding job status

Log ofjob events

edg-job-status <dg-job-id>

Job status

Logging and bookkeeping.

Page 20: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

Future UK plans The WMS will be change with ARDA(e.g. will go to pull rather push model for job distribution)

UK emphasis is going to be on testing scalability

Plan is:Instrument WMS code

Build testbed (between Imperial HEP and LeSC) capable of simulating the load of entire LCG

Understand the characteristics of different sorts of (HEP) job and feed this into simulation.

Also Plan:To examine and understand the performance of the WMS in operation.

Page 21: Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

Future UK plans

Details of the testbed construction to be worked out, however this effort will be integrated into the EGEE/LCG testplan.

This effort also neatly dovetails into the GridCC project (see talk at GridPP11?)