workload management status of current activity gridpp 13, durham, 6 th july 2005

29
Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

Upload: ian-hall

Post on 28-Mar-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

Workload Management

Status of current activity

GridPP 13, Durham, 6th July 2005

Page 2: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Activity…

• Scalability testing

• Analysis of current middleware performance

• SGE integration

• GridCC

Page 3: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Scalability Testing

People involved:

Janusz Martyniak, Luke Dikens, Barry MacEvoy, Steve McGough, David Colling

Page 4: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Scalability Testing

• From EDG we knew that it was easy to build a system capable of running 5 jobs concurrently.

• No so easy to build one capable of running 500 jobs or 5000 jobs concurrently.

• The plan was to perform testing to find software bottlenecks and hot spots • Feed the results back to the developers in a “virtuous circle”

Why…

Page 5: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Scalability Testing

The methodology …

• Original plan was to build a testbed across 2 sites (Imperial HEP and LeSC). This was deliverable X.Y

• Take an “engineering” approach. I.e. Submit tests to the testbed and monitor how the different components respond.

• Metrics to be tested to evolve in complexity as the stability grew.

Page 6: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Scalability Testing

What happened…• Decided to join the JRA1 testbed instead of forming

our own. This gave us better access to the developers and much support on other parts of the system that we were not directly testing but which are needed to run the tests e.g. VOMs, RGMA. Also thus made a contribution to wider community. This decision has been praised by Bob Jones and Frederic Hemmer.

• Still decided to run two sites (as per deliverable) as this gave a better testing environment for scalability tests

Page 7: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Scalability Testing

• What happenned …

• We were delayed by the late release of the WMS in EGEE

• However have had two sites in JRA1 testing since immediately after the Athens meeting. The two sites are maintained by JM and LD and they consist of:–Machines: 1 WMS

1 CE2 WNs1 RGMA Server.1 IO Server1 UI

Install: ManualConfig: Site (mostly)Version: R1.1

–Machines: 1 WMS2 CEs (+1)2 WNs (+1)

Install: aptConfig: SiteVersion: R1.1 (+ QF7&8)

Site 1 Site 2

Page 8: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Scalability Testing

To add to these sites…

• SEs • VOMS• Second RGMA server (to complete

split)

Page 9: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Scalability Testing

Actual testing…• Only really started writing scalability tests a couple

of weeks ago• Have defined some basic metrics

– Time to submit as a function of number of jobs for serial submission

– Time to submit for parallel submission– Failure rates as function of active jobs– etc

• Use LB database and system monitoring on WMS node to reconstruct what is going on

Page 10: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Scalability Testing

So, 100 simple jobs submitted sequentially…

• Result preliminary• Example of what we are trying to do• Bypassed known problems especially

cross matching• Summary…

Page 11: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Scalability Testing

28 Success53 Proxy expired (12 hours after the jobs were submitted !)3 Aborted due to reaching retry count16 Ready state

Summary…

In this sample greatest source of failure is CondorC

Page 12: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Scalability testing

Time to RegJob

0

5

10

15

20

25

Seconds from start

Nu

mb

er

of

en

trie

s

All registered in 3 minutes

100 jobs submitted sequentially

Page 13: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Scalability Testing

Time to EnQueued

1

10

100

1000

0

1000

0

2000

0

3000

0

4000

0

5000

0

6000

0

7000

0M

ore

Time (seconds)

Nu

mb

er

of

en

trie

s

Greatest number <5000s(Excel binning)

Long tail of retries

Page 14: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Scalability Testing

Activity on WMS during submission

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

21:07:12 21:21:36 21:36:00 21:50:24 22:04:48 22:19:12 22:33:36

Time

Use

rCP

U u

sag

e

100 jobs submitted sequentially

5 Minutes Still activity 1 hour later

Can plot for individualor groups of processes

Page 15: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Scalability Testing

Future Plans…• Automate testing scripts

• Output directed to web-pages

• Expand metrics as appropriate

Page 16: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Performance of middleware

We access to the job data through the LB databases, so why not have a look?

• People involved Gidon Moont and David Colling

Page 17: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Performance of middleware

Long tail

Page 18: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Performance of middleware

Efficiency RunTime (s)

Nu

mb

er o

f en

tries

Page 19: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

• Future plans…

• Keep monitoring this across different releases

• Low level activity • Feedback into JRA2

Performance of middleware

Page 20: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

SGE Porting

People involved David McBride, Mona Aggarwal and Owen Maroney

Page 21: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

SGE Porting

LCG Integration with Sun Grid Engine (SGE)

• Wish to add LCG as an additional entry point for our existing SGE cluster

• Problem: LCG installation assumes the use of PBS as the cluster management system.

• Solution: replace PBS-specific components with SGE specific components.

Page 22: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

SGE PortingPBS-specific components

in LCG(That need replacing)• Globus JobManager

– Already have an existing alternative Globus JobManager for Sun Grid Engine to replace lcgpbs version.

– Implemented in Perl, well understood. – Supports 5.x, 6.x revisions of SGE.– Currently installed, about to enter the first

run of testing as part of an LCG CE installation.

Page 23: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

SGE PortingPBS-specific Components in

LCG (That need replacing)

• Information Reporter– Have developed first-pass attempt at an

SGE information reporter. – Again, developed in Perl, small, relatively

straightforward. (Existing PBS code wasn't very clear, but GLUE Schema is public.)

– Installed on site CE, about to enter first run of validation and iterative improvement.

Page 24: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

SGE PortingPBS-specific components

in LCG(That need replacing)• Accounting (APEL)

– APEL: Accounting using PBS Event Logs.– SGE does have advanced accounting records but are not

stored in the same format as PBS!– Existing Java-based tooling seems large and complex for

what should be a fairly straightforward task; not obvious where changes could/should be made.

– Refactored version exists in gLite, but would still require new implementation of SGE-specific backend.

– Using updated gLite revision on site may well work, but would introduce manageability issues at upgrade-time.

– Currently wondering whether APEL can simply be replaced with a small perl script(!) Currently looking up for documentation on the APEL/R-GMA reporting interface.

Page 25: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

SGE Porting

Community of Interest formed

• Code available from:http://www.lesc.ic.ac.uk/projects/SGE-

LCG.html

• Mailing list

[email protected]

Page 26: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

GridCC

People involved: Marko Krznaric, Janusz Martyniak, Luke

Dickens, John Darlington, Steve McGough, David McBride and David Colling

+ Tiziana & Costas

Page 27: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

GridCC

Lot about GridCC at GridPP12 so brief update

• Discussions between GridCC and EGEE (Bob Jones and Frederic Hemmer)

• Agreed to collaborate (e.g. use EGEE CVS) GridCC relies on EGEE

• First release September this year

• Review October this year

Page 28: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

GridCC

Bits in red from UK wms activity

Page 29: Workload Management Status of current activity GridPP 13, Durham, 6 th July 2005

5th July 2005 Workload Management

David Colling, Imperial College London

Summary

• Activity in 4 areas – testing, – analysis, – SGE port, – GridCC