accounting in lcg dave kant cclrc, e-science centre

13
Accounting in LCG Dave Kant CCLRC, e-Science Centre

Upload: daisy-cummings

Post on 04-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accounting in LCG Dave Kant CCLRC, e-Science Centre

Accounting in LCG

Dave Kant

CCLRC, e-Science Centre

Page 2: Accounting in LCG Dave Kant CCLRC, e-Science Centre

LCG GDB Rome 2

APEL in LCG/EGEE

1. Quick Overview

2. The current state of play

3. Integration with OSG

4. Accounting in gLite

Page 3: Accounting in LCG Dave Kant CCLRC, e-Science Centre

LCG GDB Rome 3

Overview

• Data Collection via Sensors

• Transportation via RGMA

• High level Aggregation and Reporting via Graphical Front-end

High Level Reporting: Tables, Pies, Gantts, Metrics, Trees

Aggregation

Page 4: Accounting in LCG Dave Kant CCLRC, e-Science Centre

LCG GDB Rome 4

Component View of APEL

Sensors (Deployed at site) :-• Process log files; maps DN to Batch Usage; • Builds accounting records: DN, CPU, WCT, SpecInt2000 etc• Accounts for Grid Usage (Jobs) Only• Supports PBS, SunGridEngine, Condor, and LSF• Not REAL-TIME accounting

Data Transport:- • Uses RGMA to send data to a central repository• 196 sites publishing, 7.7 Million Job records collected• Could use other transport protocols• Allows sites to control exports of DN information from site

Presentation (GOC and Regional Portal)• View, EGEE View, GridPP View, Site View• Reporting based on data aggregation• Metrics (e.g. Time Integrated CPU Usage)• Tables, Pies, Gantt Charts,

Page 5: Accounting in LCG Dave Kant CCLRC, e-Science Centre

LCG GDB Rome 5

Demos of Accounting Aggregation

Global views of CPU resource consumption. • LHC View • http://goc.grid-support.ac.uk/gridsite/accounting/tree/treeview.php

Shows Aggregation for each LHC VO• Requirements driven by RRB • Tier-1 and Countries are the entry points• LHC VO only• All data normalised in units of 1000 . SI2000 . Hour

• GridPP View• http://goc.grid-support.ac.uk/gridsite/accounting/tree/gridppview.php

Shows Aggregation for an Organisation at Tier1/Tier2 level

• EGEE View (New!)• http://www3.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php

Regional Views and detailed site level reporting Active Development by CESGA/RAL Pablo Rey Mayo, Javier Lopez, Dave Kant

Page 6: Accounting in LCG Dave Kant CCLRC, e-Science Centre

LCG GDB Rome 6

VOs/LCG/EGEE Requirements

• One line summary “How Much is Done, and Who did it”.

• High Level Anonymous Reporting How much resource has been provided to each VO Aggregation across: VOs, Countries, Regions, Grids, Organisations Granularity: time frame: Weeks, Quarterly, Annually

• Finer Granularity at User Level If 10,000 CPU hours were consumed by Atlas VO, who are the

users that submitted the work? Data privacy laws A Grid “DN” is personal information which could be used to target

an individual. Who has access to this data and how do you get it?

Page 7: Accounting in LCG Dave Kant CCLRC, e-Science Centre

LCG GDB Rome 7

APEL Developments

• Extending Batch System Support (Testing Phase) Support for Condor and SGE. Both are being tested: SGE by CESGA and

Condor by GridPP. Un-official releases are available on the APEL Home page.

http://goc..grid-support.ac.uk/gridsite/accounting/sge.html http://goc.grid-support.ac.uk/gridsite/accounting/condor-prelim.html

• Gap Publisher (Testing Phase) Provide sites with better tools to identify and to publish missing data into the

archiver. The reporting system uses Gantt charts to identify gaps, and enhancements to the publisher module are being tested.

Page 8: Accounting in LCG Dave Kant CCLRC, e-Science Centre

LCG GDB Rome 8

APEL Issues…1

• Normalisation (Under investigation, CESGA/RAL) Recall that in order to account for usage across heterogeneous

compute farms, data are scaled to a common reference in LCG Reference Scale = 1K.SI2000

Job records scale factor is SI2000_Published_by_Site / Reference Some sites have a large number of job records where the site SI2000 is

zero. Identify sites via the reporting tools and provide recipe to fix.

• APEL Memory Usage (Important, will become urgent…) Site databases are growing ever larger: APEL requires more memory in

order to join records (RAL Tier-1 requires 2GB RAM for full build) Implement a scheme to reduce the number of redundant records

used in the Join process: flag rows used in a successful build and delete them as they are no longer needed.

• DN Accounting ? Should APEL account for local usage as well as grid usage? BNL recently sent data to us that included both Grid and local usage

Page 9: Accounting in LCG Dave Kant CCLRC, e-Science Centre

LCG GDB Rome 9

APEL Issues…2

• Handling Large Log files (Under Investigation) Condor history and SGE batch logs are very large (> 1 GB ) Large logs are problematic: large amount of memory to read / store

records inline. Application run time grows! We don’t want to re-read data that was passed on a previous run (efficiency).

Develop an efficient way to parse these logs? Or ask batch log providers to support log rotation? Or provide a recipe to site admins?

• Recipe to site admins half-work as events are lost: event data split over multiple lines.

• RGMA Queries to Central Repository Query response time very slow. Prevents some sites from checking

continuous consumers are actually listening for data. Would need to archive data from the central repository to another

database in order to speed up such queries. Not an issue for the reporting front-end Does not appear to be something that sites urgently need (requested

by IN2P3-CC).

Page 10: Accounting in LCG Dave Kant CCLRC, e-Science Centre

LCG GDB Rome 10

Integration with OpenScienceGrid

• A few OSG sites have deployed a minimal LCG front-end to publish accounting data into the APEL database (GOCDB registration + APEL sensors + RGMA MON node) Successful deployment at University of Indiana (PBS and Condor data published)

• Due to (subtle) differences in the grid middleware, APELs Core library must be modified to build accounting records in the OSG environment. LCG: DN local batch jobId mappings encoded within three log files: LCG job manager OSG: DN local batch jobId mappings in single log file; globus job manager?

• Main Issues Under Consideration Currently there are THREE versions of APEL CORE library, each sharing common batch system

plugins:• LCG production release, gLite 3 development, OSG development

Refactoring of core library to create a new plugin? LCG/gLite/OSG ? A more sensible approach would be to use a *common* accounting file in BOTH gLite and OSG

to provide the grid DN Local Batch JobId mapping

Need a common agreement for log rotation: Prefer lognname-YYYYMMDD.gz (static file) to logname-1.gz (not-static)

• Very much in the early stages, need some common agreements and some more understanding of OSG middleware before proceeding.

Page 11: Accounting in LCG Dave Kant CCLRC, e-Science Centre

LCG GDB Rome 11

Accounting in gLite 3

• In gLite the BLAH daemon (provided by Condor) is used to mitigate jobs between the WMS and the Compute element.

• Consequently, accounting information needed by APEL is no longer in the gatekeeper logs but found elsewhere e.g. in local user home directory.

• An accounting mapping file has been proposed by DGAS and implemented by gLite middleware developers to simplify the process of building accounting records. For mapping grid-related information to the local job ID Independent of submission procedure (WMS or not ...) No services or clients required on the WN Format (one line per job, daily log rotation) timestamp=<submission time to LRMS userDN=<user's DN> userFQAN=<user's FQAN> ceID=<CE ID> jobID=<grid job ID> lrmsID=<LRMS job ID> localUser=<uid>

• Already implemented for BLAH (and CREAM) work in progress for LCG• Did not make it into gLite3.0 – no accounting for gLiteCE• APEL development to begin in April (D.Kant)• Development and Testing expected to take most of April

Page 12: Accounting in LCG Dave Kant CCLRC, e-Science Centre

LCG GDB Rome 12

DGAS

• DGAS meets some requirements for privacy of user identity user job info only readable by user, site manager and VO manager

• DGAS cannot aggregate info across whole Grid

• Solution 1 – DGAS sensors also publish anonymous data to central APEL repository, User details available in DGAS HLR for VO

• Solution 2 – A higher level repository that HLRs can all publish into. GGF Resource Usage Service – RHUL working on an

implementation

• BUT DGAS not in gLite3.0

Page 13: Accounting in LCG Dave Kant CCLRC, e-Science Centre

LCG GDB Rome 13

Summary

• We have a working accounting system

• but work is still required to keep it working to meet (conflicting?) outstanding requirements for

• Privacy• User information