experiment dashboard for monitoring computing activities of the lhc virtual organizations

17
J Grid Computing (2010) 8:323–339 DOI 10.1007/s10723-010-9148-x Experiment Dashboard for Monitoring Computing Activities of the LHC Virtual Organizations Julia Andreeva · Max Boehm · Benjamin Gaidioz · Edward Karavakis · Lukasz Kokoszkiewicz · Elisa Lanciotti · Gerhild Maier · William Ollivier · Ricardo Rocha · Pablo Saiz · Irina Sidorova Received: 24 August 2009 / Accepted: 19 February 2010 / Published online: 28 April 2010 © Springer Science+Business Media B.V. 2010 Abstract The Large Hadron Collider (LHC) is preparing for data taking at the end of 2009. The Worldwide LHC Computing Grid (WLCG) pro- vides data storage and computational resources for the high energy physics community. Operating the heterogeneous WLCG infrastructure, which integrates 140 computing centers in 33 countries all over the world, is a complicated task. Re- liable monitoring is one of the crucial compo- nents of the WLCG for providing the functionality and performance that is required by the LHC experiments. The Experiment Dashboard system provides monitoring of the WLCG infrastructure from the perspective of the LHC experiments and covers the complete range of their computing activities. This work describes the architecture of the Experiment Dashboard system and its main monitoring applications and summarizes current J. Andreeva (B ) · B. Gaidioz · E. Karavakis · L. Kokoszkiewicz · E. Lanciotti · G. Maier · W. Ollivier · R. Rocha · P. Saiz · I. Sidorova CERN, European Organization for Nuclear Research, 1211 Geneva 23, Switzerland e-mail: [email protected] M. Boehm Hewlett-Packard, 65428 , Roselsheim, Germany E. Karavakis School of Engineering and Design, Brunel University, Uxbridge, UB8 3PH, UK experiences by the LHC experiments, in partic- ular during service challenges performed on the WLCG over the last years. Keywords WLCG · Distributed infrastructure · LHC experiments · Monitoring 1 Introduction Preparation is under way to restart the LHC [1]. The experiments at the LHC are all run by in- ternational collaborations, bringing together sci- entists from institutes all over the world. Each experiment is distinct, characterized by its unique particle detector. The two large experiments, ATLAS [2] and CMS [3], are based on general- purpose detectors designed to investigate the largest range of physics possible. Two medium- size experiments, ALICE [4] and LHCb [5], have specialized detectors for analyzing the LHC colli- sions in relation to specific phenomena. User com- munities of the LHC experiments are organized in the virtual organizations (VOs). The LHC is estimated to produce about 15 petabytes of data per year. This data has to be dis- tributed to computing centers all over the world with a primary copy being stored on tape at CERN [6]. Seamless access to the LHC data has to be provided to about 5,000 physicists from 500 scientific institutions. The scale and complexity of

Upload: independent

Post on 21-Apr-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

J Grid Computing (2010) 8:323–339DOI 10.1007/s10723-010-9148-x

Experiment Dashboard for Monitoring ComputingActivities of the LHC Virtual Organizations

Julia Andreeva · Max Boehm · Benjamin Gaidioz · Edward Karavakis ·Lukasz Kokoszkiewicz · Elisa Lanciotti · Gerhild Maier · William Ollivier ·Ricardo Rocha · Pablo Saiz · Irina Sidorova

Received: 24 August 2009 / Accepted: 19 February 2010 / Published online: 28 April 2010© Springer Science+Business Media B.V. 2010

Abstract The Large Hadron Collider (LHC) ispreparing for data taking at the end of 2009. TheWorldwide LHC Computing Grid (WLCG) pro-vides data storage and computational resourcesfor the high energy physics community. Operatingthe heterogeneous WLCG infrastructure, whichintegrates 140 computing centers in 33 countriesall over the world, is a complicated task. Re-liable monitoring is one of the crucial compo-nents of the WLCG for providing the functionalityand performance that is required by the LHCexperiments. The Experiment Dashboard systemprovides monitoring of the WLCG infrastructurefrom the perspective of the LHC experimentsand covers the complete range of their computingactivities. This work describes the architecture ofthe Experiment Dashboard system and its mainmonitoring applications and summarizes current

J. Andreeva (B) · B. Gaidioz · E. Karavakis ·L. Kokoszkiewicz · E. Lanciotti · G. Maier ·W. Ollivier · R. Rocha · P. Saiz · I. SidorovaCERN, European Organization for Nuclear Research,1211 Geneva 23, Switzerlande-mail: [email protected]

M. BoehmHewlett-Packard, 65428 , Roselsheim, Germany

E. KaravakisSchool of Engineering and Design, Brunel University,Uxbridge, UB8 3PH, UK

experiences by the LHC experiments, in partic-ular during service challenges performed on theWLCG over the last years.

Keywords WLCG · Distributed infrastructure ·LHC experiments · Monitoring

1 Introduction

Preparation is under way to restart the LHC [1].The experiments at the LHC are all run by in-ternational collaborations, bringing together sci-entists from institutes all over the world. Eachexperiment is distinct, characterized by its uniqueparticle detector. The two large experiments,ATLAS [2] and CMS [3], are based on general-purpose detectors designed to investigate thelargest range of physics possible. Two medium-size experiments, ALICE [4] and LHCb [5], havespecialized detectors for analyzing the LHC colli-sions in relation to specific phenomena. User com-munities of the LHC experiments are organized inthe virtual organizations (VOs).

The LHC is estimated to produce about 15petabytes of data per year. This data has to be dis-tributed to computing centers all over the worldwith a primary copy being stored on tape atCERN [6]. Seamless access to the LHC data hasto be provided to about 5,000 physicists from 500scientific institutions. The scale and complexity of

324 J. Andreeva et al.

the task shortly described above requires complexcomputing solutions. A distributed, tiered com-puting model was chosen by the LHC experimentsfor the implementation of the LHC data process-ing task. The LHC experiments use the WLCG[7] distributed infrastructure for their computingactivities. In order to monitor the computing ac-tivities of the LHC experiments, several specificmonitoring systems were developed. Most of themare coupled with the data-management and theworkload-management systems of the LHC VOs,for example PhEDEx [8], Dirac [9], PanDA [10]and AliEn [11]. In addition, a generic moni-toring framework was developed for the LHCexperiments—the Experiment Dashboard. If thesource of the monitoring data is not VO-specific,the Experiment Dashboard monitoring applica-tions can be shared by several VOs. Otherwise,the Experiment Dashboard offers experiment-specific monitoring solutions for the scope of asingle experiment.

To demonstrate readiness for the LHC datataking, several computing challenges were runon the WLCG infrastructure over the last years.The latest one, Scale Testing for the ExperimentProgramme’09 (STEP09) [12], took place in June2009. The goal of STEP09 was the demonstrationof the full LHC workflow from data taking to useranalysis. The analysis of the results of the STEP09and of the earlier WLCG computing challengesproved the key role of the experiment-specificmonitoring systems, including Experiment Dash-board, in operating the WLCG infrastructure andin monitoring the computing activities of the LHCexperiments.

The Experiment Dashboard allows to esti-mate the quality of the infrastructure and todetect any problems or inefficiencies. Further-more, it provides the necessary information toconclude whether the LHC computing tasks areaccomplished.

The WLCG infrastructure is heterogeneousand combines several middleware flavors: gLite[13], OSG [14] and ARC [15]. The ExperimentDashboard project works transparently across allthese different Grid flavors.

The main computing activities of the LHC VOsare data distribution, job processing, and site com-missioning. The Experiment Dashboard provides

monitoring covering the various computing activ-ities mentioned above. In particular, the site com-missioning aims to improve the quality of everyindividual site, therefore ameliorating the overallquality of the WLCG infrastructure.

The Experiment Dashboard is intensively usedby the LHC community. Taking as an examplethe CMS Dashboard server, according to the Webstatistics tool [16], during year 2009 monthly num-ber of unique visitors increased three times, from1,300 in the beginning of the year to 3,900 inautumn. In average 50,000 pages are viewed daily.The users of the system can be classified intovarious roles: managers and coordinators of theexperiment computing projects, site administra-tors, and LHC physicists running their analysistasks on the Grid.

Section 2 shortly describes the related work.The architecture of the Experiment Dashboardsystem and implementation details of the Ex-periment Dashboard framework are described inSection 3. Sections 4–6 overview the main moni-toring applications of the Experiment Dashboardsystem. The applications are discussed in the fol-lowing order: data-management monitoring appli-cations, applications for the job monitoring, andapplications for the site monitoring and commis-sioning activity.

Section 7 covers the recent developments,which focus on high level cross-VO views of thecomputing activities of the LHC experiments,both at the level of a single site and at a globallevel.

Close collaboration between the developers ofthe Experiment Dashboard and the user commu-nities is described in Section 8.

In the conclusions we summarize the role of theExperiment Dashboard monitoring system andthe current experiences of using it by the LHCcommunity.

2 Related Work

One approach for monitoring Grid infrastruc-tures, used by systems such as MonALISA [24]and GridICE [34] is to aggregate and display datafrom the local fabrics monitoring of the distrib-uted clusters. They provide a system-level view

Monitoring Computing Activities of LHC Virtual Organizations 325

of the Grid resources. Monitoring systems such asGridView [17] and Real Time Monitor [18] obtaininformation from the Grid services, for examplefrom the Logging and Bookkeeping service [19].Monitoring data provided by the systems men-tioned above is useful for service providers andpeople operating distributed infrastructure, butdoes not necessarily cover the needs of the usercommunities, since it is missing application-levelinformation.

One of the possible ways to evaluate Gridinfrastructure from the perspective of the usercommunities is to execute functional tests, whichemulate user activity at the Grid sites, and tocollect and display the results of these tests. Thisapproach is applied on the Grid infrastructuresincluding WLCG and TeraGrid [35]. For WLCGthis task is performed by the Service AvailabilityMonitoring (SAM) [20], for TeraGrid by Inca [35]monitoring system. Though monitoring of func-tional tests is rather effective for detecting prob-lems with the distributed sites and services it cannot substitute monitoring of real user activities.

Most of the existing tools which provide mon-itoring of real user activities were, as a rule, de-veloped as a part of the data management andworkload management systems of the LHC ex-periments (Phedex, Dirac, PanDA). They pro-vide data transfer and job processing monitoringincluding application-level information, namely,application exit code, name of the data collection,detailed failure reason, version of the experiment-specific software. The limitation of these systemsis that they are working in the scope of a single ex-periment and do not provide common monitoringsolutions which can be shared by several virtualorganizations.

Experiment Dashboard aims to combineGrid-related monitoring information with theapplication-related data and to offer wherepossible common monitoring solutions which canwork across several virtual organizations.

3 Experiment Dashboard Framework

The common structure of the Experiment Dash-board service consists of information collec-tors, data repositories, normally implemented in

ORACLE databases, and user interfaces. TheExperiment Dashboard uses multiple sources ofinformation, like:

• Other monitoring systems, like the ImperialCollege Real Time Monitor (ICRTM) or theService Availability Monitoring (SAM)

• gLite Grid services, such as the Logging andBookkeeping service (LB) or CEMon [21]

• Experiment specific distributed services, forexample the ATLAS Data Management ser-vices or distributed Production Agents forCMS

• Experiment central databases (like thePanDA database for ATLAS)

• Experiment client tools for job submission,like Ganga [22] and CRAB [23]

• Jobs instrumented to report directly to theExperiment Dashboard

This list is not exhaustive. Information can betransported from the data sources via various pro-tocols. In most cases, the Experiment Dashboarduses asynchronous communication between thesource and the data repository. For several years,in the absence of a messaging system as a standardcomponent of the gLite middleware stack, theMonALISA monitoring system was successfullyused as a messaging system for the ExperimentDashboard job monitoring applications. Cur-rently, the Experiment Dashboard is being instru-mented to use the Messaging System for the Grid(MSG) [25] for the communication with the infor-mation sources.

A common framework providing componentsfor the most usual tasks was established to fulfillthe needs of the dashboard applications beingdeveloped for all experiments. The schema of theExperiment Dashboard framework is presented inFig. 1.

The Experiment Dashboard framework is im-plemented in Python. The tasks performed onregular basis are implemented by the Dashboardagents. The framework provides all the necessarytools to manage and monitor these “agents”, eachfocusing on a specific subset of the required tasks,such as collection of the input data or computationof the daily statistics summaries.

To ensure a clear design and maintainability ofthe system, the definition of the actual monitoring

326 J. Andreeva et al.

Fig. 1 The experimentdashboard frameworkschema

application queries is decoupled from the inter-nal implementation of the data repository. Everymonitoring application implemented within theExperiment Dashboard framework comes withthe implementation of one or more data accessobjects (DAO), which represents the “data ac-cess interface”: a public set of methods for theupdate and retrieval of information. Access tothe database is done using a connection pool toreduce the overhead in creating new connections,therefore the load on the server is reduced and theperformance increased.

The Experiment Dashboard requests are han-dled by a system following the Model-View-Controller (MVC) pattern. They are handledby the “controller” component, launched by theapache mod_python extension, which keeps theassociation between the requested URLs andthe corresponding “actions”, executing them andreturning the data in the format requested bythe client. All actions will process the requestparameters and execute a set of operations, whichmay involve accessing the database via the DAOlayer. When a response is expected, the action willstore it in a python object, which is then trans-formed into the required format (HTML page,plain XML, CSV, image) by the “view” compo-nents. Applying the view to the data is performedautomatically by the controller.

All the Experiment Dashboard information canbe displayed using HTML, so that it can be viewedin any browser. Moreover, the Experiment Dash-board framework also provides the functional-ity to retrieve information in XML (extensiblemarkup language), CSV (comma separated val-ues), JSON (javascript object notation) or imageformats. This flexibility allows the system to beused not only by users but also by other ap-plications. A set of command line tools is alsoavailable.

The current Web page frontends are based onXSL style sheet transformations over the XMLoutput of the HTTP requests. In some cases moredynamic Web interfaces are available by issuingasynchronous javascript requests from the clientbrowser (AJAX), gradually building the Webpage and offering improvements in navigation andperformance.

Recently, support for the Google Web Toolkit(GWT) was added to the framework. Some ofthe applications have started to be migrated tothis new client interface model, which gives greatbenefits both for users and developers—compiledcode, easier support for all browsers, out of thebox widgets.

All components are included in an automatedbuild system based on the Python distutils, withadditional or customized commands enforcing

Monitoring Computing Activities of LHC Virtual Organizations 327

strict development and release procedures. In to-tal, there are more than fifty modules in theframework, and fifteen of them being commonmodules offering the functionality shared by allapplications.

The modular structure of the Dashboardframework enables flexible approach for imple-menting the needs of the customers. For example,for the CMS production system, Dashboard pro-vides only implementation of the data repository.Data retrieved from the Dashboard database inXML format is presented to the users via userinterface developed by the CMS production teamin the CMS Web-tools framework [26].

Parts of the Dashboard framework are usedby other projects, some unrelated to monitoring,as is the case with the ATLAS Distributed DataManagement system described below.

4 Experiment Dashboard Monitoringof the Distributed Data Management System

The Experiment Dashboard provides monitor-ing applications for several data-managementtasks, namely monitoring of data-transfer for theALICE experiment, monitoring of the ATLASDistributed Data Management (DDM) systemand monitoring of the file transfer service (FTS)[27]. The last one is currently under development.

This section describes in detail the ATLASDDM Dashboard as an example of the data-management monitoring applications.

4.1 Introduction to ATLAS DDM

The ATLAS DDM Dashboard plays a central rolein ATLAS computing operations. It is widely usedby the people taking computing shifts. ATLASDDM monitoring serves about 1,000 unique visi-tors per month and up to 250,000 pages are vieweddaily.

Matchmaking between computing tasks andavailable resources in ATLAS is done based onthe location of the data and requires data prox-imity for resources to be considered eligible. TheATLAS DDM system is responsible for dataplacement on resources, and rapid export of datasets is important to ensure full utilization. This

makes DDM a complex component with highavailability requirements, and it therefore benefitsfrom having a comprehensive monitoring system.

The ATLAS DDM system is composed ofdifferent components which interact with eachother to perform the different bookkeeping anddata movement tasks. The catalogs take care ofthe bookkeeping, while the site services try to an-swer the different user requests to spread existingdata among the different sites.

With the expected amount of data being in theorder of several petabytes per year, and the filesizes varying from the controlled production data(bigger and thus less files) to the more chaotic useranalysis results (typically higher number of files),having collections of files greatly reduces the di-mension of the bookkeeping task and improvesperformance as the dataset is also used as the unitof data movement. And being simply an abstrac-tion over the physical data stored in individualfiles, datasets are also dynamic, as users can addor remove files from the collection, creating newversions of the same dataset.

4.2 Architecture of Dashboard DDM

The Dashboard DDM is composed of two mainsets of components: the Web application and allthe actions available through its HTTP based in-terface, and the agents, each taking care of aspecific task. All of them use a single backenddatabase, where all data concerning the monitor-ing of DDM and Grid fabric components is stored.

The application’s most visible use is of coursethe access to the data using a Web browser.

XHTML is the default output format for anyquery resulting in response data, though otherformats are available as mentioned above. Theavailable functionality via the Web interface isdepicted in Fig. 2.

The second part of the application is the infor-mation collection from the DDM system. It relieson a messaging system pattern, where differentmessages are delivered to pre-defined queues ortopics. Currently HTTP is used in productionfor the delivery of the messages, but any othermessaging backend supported by the Dashboardframework can be used, like MonALISA or JavaMessage Service (JMS).

328 J. Andreeva et al.

Fig. 2 Web application flow. Users navigate from Activity or Clouds (groups of sites) overview pages down to individualfile event pages

The main reason of using HTTP for deliveringthese messages is the reduced number of externaldependencies on the client. The only disadvan-tage of this option is the lack of reliability in themessage delivery. This is a feature that does notexist natively in the HTTP protocol, so a clientside layer with persistent storage had to be de-veloped for cases where the endpoint is down orunreachable. Several modern messaging systemsprovide this functionality both at the client andbroker levels.

Besides the collection of dataset and file trans-fer information from the site services, there areindividual agents performing different tasks in theDDM Dashboard. They cover tasks like collect-ing service status and availability from the SiteAvailability Monitor (SAM) service, monitoringthe available space in the storage elements, gen-erating statistics regarding site and site data linkperformance—throughput, number of files anddatasets moved, average dataset and file comple-tion time, dataset and file time in queue, etc., anderrors during both transfer and registration, gen-erating and sending alarms to system operators.This information is gathered by doing periodicqueries to the database.

4.3 Performance

Due to the amount of data being collected, adetailed design of the backend database was re-quired. Moreover, the Web application endpointwas also taken into account, not really due toheavy data querying but due to it also being used

to collect the different messages coming from theDDM system.

4.3.1 Backend Database

Rough figures used when designing the systemindicated half a million datasets being created peryear, each having 10 to 10,000 files. With around80 storage areas at the time (both disk and tape)it was easy to calculate a worst case scenario andfrom these numbers it was decided to rely on theOracle database service supported by the CERNIT (Information Technology) department—eliminating the need for the setup and support ofa non-trivial database infrastructure.

The most impressive number comes not fromthe datasets or files, but from the individual filetransfer state reports, which evolve over time.This is of course temporary data, useful for realtime debugging but less useful once informationbecomes old and statistics have been gathered.Even so, the period of time where this informationis kept should be made as big as possible. Oraclepartitioning [37] was especially important for thefile state history table, but is also used in bothdatasets and files tables.

4.3.2 Web Frontend

Choosing HTTP as the messaging protocol meansthat a server receiving messages from a differentsite service instances is hit very hard and shouldbe highly optimized. Monitoring the system hasshown peaks of up to 90 requests per second,

Monitoring Computing Activities of LHC Virtual Organizations 329

with five requests per second in normal operation.Considering that most of these (90%) are bulkrequests, the load on the service is very high. Thechoice was to use the Apache Web Server, whichprovides the required performance.

5 LHC Job Processing and the ExperimentDashboard Applications for Job Monitoring

The LHC job processing activity can be split intwo categories: processing raw data and large-scale Monte-Carlo production, and user analy-sis. The main difference between the mentionedcategories is that the first one is a large scale,well-organized activity, performed in a coordi-nated way by a group of experts, while the sec-ond one is chaotic data processing by membersof the huge distributed physics community. Usersrunning physics analysis do not necessarily haveenough knowledge about the Grid and profoundexpertise in computing in general. With the restartof the LHC, a considerable increase of analysisusers is expected. Clearly, for both categories ofjob processing, complete and reliable monitoringis a necessary condition for the success.

The organization of the workload managementsystems of the LHC experiments differs from oneexperiment to another. While in the case of AL-ICE and LHCb the job processing is organized viaa central queue, in the case of ATLAS and CMSthe job submission instances are distributed andthere is no central point of control as in ALICE orLHCb. Therefore, the job monitoring for ATLASand CMS is a more complicated task and it isnot necessarily coupled to a specific workloadmanagement system. The Experiment Dashboardprovides several job monitoring solutions for var-ious use cases, namely the generic job monitoringapplications, monitoring for ATLAS and CMSproduction systems, and applications focused onthe needs of the analysis users. The generic jobmonitoring, which is provided for all LHC exper-iments, is described in more detail in Section 5.1.This application is generic and was used outsidethe LHC community by the VLEMED virtualorganization [36]. Since the distributed analysisis currently one of the main challenges for theLHC computing, several new applications were

built recently on top of the generic job monitor-ing, mainly for monitoring of the analysis jobs.Section 5.2 gives a closer look at the CMS TaskMonitoring as an example of the job monitoringapplications for user analysis.

5.1 Experiment Dashboard Generic JobMonitoring Application

The overall success of the job processing dependson the performance and stability of the Gridservices involved in the jobs processing and onthe services and software which are experiment-specific. Currently, the LHC experiments are us-ing several different Grid middleware platformsand therefore a variety of Grid services. Regard-less of the middleware platform, access from therunning jobs to the input data as well as savingoutput files to the remote storage are currentlythe main reasons for job failures. Stability andperformance of the Grid services, like the stor-age element (SE), storage resource management(SRM) and various transport protocols, are themost critical issues for the quality of the dataprocessing. Furthermore, the success of the userapplication depends as well on the experiment-specific software distribution at the site, the datamanagement system of the experiment and theaccess to the alignment and calibration data of theexperiment known as “conditions data”. Thesecomponents can have a different implementationfor each experiment and they have a very strongimpact on the overall success rate of the user jobs.The Dashboard Generic Job Monitoring Appli-cation tracks the Grid status of the jobs and thestatus of the jobs from the application point ofview. For the Grid status of the jobs, the Experi-ment Dashboard used the Grid related systems asan information source. In the past, the RelationalGrid Monitoring Architecture (RGMA) [28] andthe Imperial College Real Time Monitor wereused as information sources for Grid job statuschanges. Unfortunately, none of the mentionedsystems provided complete and reliable data. Thecurrent development aimed to improve the situ-ation of publishing the job status changes by theGrid services involved in the job processing, whichis described in Section 5.1.2. To compensate thelack of information from the Grid-related sources,

330 J. Andreeva et al.

the job submission tools of the ATLAS and CMSexperiments were instrumented to report job sta-tus changes to the Experiment Dashboard system.Every time when the job submission tools querythe status of the jobs from the Grid services, thestatus is reported to the Experiment Dashboard.The jobs themselves are instrumented for the run-time reporting of their progress at the workernodes. The information flow of the generic jobmonitoring application is described in the nextsection.

5.1.1 Information Flow of the Generic JobMonitoring Application

Similar to the common Dashboard structure, de-scribed in Section 3, the job monitoring systemconsists of the central repository for the monitor-ing data (Oracle database), collectors, and a Webserver that renders the information in HTML,XML, CSV, or as images.

The main principles of the Dashboard job mon-itoring design are:

• to enable non-intrusive monitoring, whichmust not have any negative impact on the jobprocessing itself,

• to avoid direct queries to the informationsources and to establish asynchronous com-munication between the information sourcesand the data repository, whenever possible.

When the development of the job monitoring ap-plication started, the gLite middleware did notprovide any messaging system, so the ExperimentDashboard uses the MonALISA monitoring asa messaging system. The job submission tools ofthe experiments and the jobs themselves are in-strumented to report needed information to theMonALISA server via the apmon library, whichuses the UDP protocol. Reporting is implementedin a lightweight way, not creating any monitoringoverhead or any dependency between the report-ing of job status information and job processingitself. The situation when a UDP packet cannot besent does not prevent the job being executed. Sta-tistics collected over 5 years show that the propor-tion of lost UDP packets has never exceeded 5%,and is normally at the level of 2–3%. This level ofreliability is acceptable for the monitoring needs.

Every few minutes the Dashboard collectorsquery the MonALISA server and store job moni-toring data in the Dashboard Oracle database. Allinformation received by the MonALISA server islogged. If for any reason the database is not avail-able, monitoring data is not lost. As soon as thedatabase comes online, the Dashboard collectorconsumes information from the log file and storesit in the database.

The data related to the same job and comingfrom several sources is correlated via a uniqueGrid identifier of the job.

Following the outcome of the work of theWLCG monitoring working groups, the existingopen source solutions of messaging systems wereevaluated. As a result of this evaluation, ApacheActiveMQ [25] was proposed to be used for theMessaging System for the Grids (MSG). Cur-rently, the Dashboard job monitoring applicationis instrumented to using MSG in addition to theMonALISA messaging.

The job status shown by the Experiment Dash-board is close to the real-time status. The maxi-mum latency is 5 min, which corresponds to theinterval between the sequential runs of the Dash-board collectors.

5.1.2 Instrumentation of the Grid Servicesfor Publishing Job Status Information

As it was mentioned above, information about jobstatus changes provided by Grid-related sources iscurrently not complete and covers only a subset ofjobs. This has a bad impact on the trustworthinessof the Dashboard data. Though some job submis-sion tools are instrumented to report job statuschanges at the point when they query Grid-relatedsources, this query is done on user request. Forexample, when a user never requests the statusof his jobs and the jobs were aborted, there isno way for the Dashboard to be informed aboutthe abortion of the jobs. As a result, they canstay in “running” or “pending” status, unless beingturned into ‘terminated’ status with ‘unknown’exit code by a so-called timeout Dashboard pro-cedure. To overcome this limitation, the ongoingdevelopment aims to instrument the Grid servicesinvolved in the job processing to publish job statuschanges to the MSG. Dashboard collectors con-

Monitoring Computing Activities of LHC Virtual Organizations 331

sume information from the MSG and store it inthe central repository of the job monitoring data.The services which need to be instrumented andthe concrete implementation depend of the waythe jobs are submitted to the Grid.

In case the jobs are submitted via the gLiteWorkload Management System (WMS) the LBservice keeps full track of the job processing. TheLB provides the notification mechanism which al-lows to subscribe to the job status changes eventsand to be notified as soon as events matchingthe conditions specified by the user happen. Anew component “LB Harvester” was developedin order to register at several LB servers and tomaintain the active notification registration foreach one. The output module of the harvesterformats the job status message according to theMSG schema and publishes it into MSG.

Jobs submitted to Condor-G [29], do not usethe WMS service and correspondingly do notleave a trace in the LB. The job status changespublisher component was developed in collabora-tion of Condor and the Dashboard teams. Condordevelopers have added a job logs parsing func-tionality to the Condor standard libraries. Thepublisher of the job status changes reads newevents from standard Condor event logs, filtersevents in question, extracts essential attributesand publishes them to MSG. The publisher is runin the Condor scheduler as a Condor job. In thiscase, Condor itself takes care of publishing statuschanges.

5.1.3 Scalability and Performance

Taking as an example the CMS Dashboard JobMonitoring, the system keeps track of all jobs sub-mitted by the CMS user community. One hundredthousand to 250,000 jobs are submitted daily andup to 30,000 jobs are running in parallel. A singleMonALISA server which receives job status up-date messages can handle up to 3,000 updates persecond. The Dashboard collector groups togethermessages related to the status of a particular joband performs updates of the database. Scaling upof the Dashboard job monitoring system can beachieved by increasing of the number of MonAL-ISA servers and Dashboard collector instances.Currently, CMS is using 3 MonALISA servers.

The actual update rate of the job status informa-tion in the database is 100–200 job status updatesper second.

As was mentioned in the Section 3, currentdevelopment aims to adapt the MSG system asa transport mechanism for the Dashboard jobmonitoring application. The MSG provides a re-liable and scalable message bus for communica-tion between information source and informationconsumer. The core component of the system isa message broker which routes information fromsources to consumers. A network of intercon-nected message brokers ensures redundancy andscalability. Recent testing of the new job monitor-ing information flow using MSG confirmed thatadapting this technology will allow for increasedreliability and performance of data delivery fromthe information source to the Dashboard job mon-itoring repository.

The interactive user interface described inSection 5.1.4 shows the real-time situation regard-ing job processing in the scope of a single virtualorganization. This interface uses raw informationfrom the job monitoring database instance. Tun-ing of the database configuration and of the SQLqueries and partitioning of the job monitoringschema provide sufficient performance of the In-teractive User interface (2–4 s per request) underthe condition that information is requested aboutrecent jobs, submitted over last 48 h.

The job monitoring database instance for theCMS experiment contains information for morethan 40 million jobs per year. In order to providesufficient performance for accessing job monitor-ing information for time ranges longer than 48 h,raw monitoring data is aggregated in summarytables with hourly, daily and monthly granularity.This task is performed by Oracle scheduled jobs.Depending on the requested time range, the his-torical interface described in Section 5.1.4 queriesinformation in the corresponding summary table.

5.1.4 Job Monitoring User Interfaces

The standard job monitoring application providestwo types of user interfaces. First, the so called“interactive user interface”, which enables veryflexible access to recent monitoring data andshows the job processing for a given VO at run-

332 J. Andreeva et al.

time. The interactive UI contains the distribu-tion of active jobs and jobs terminated during aselected time window by their status. Jobs canbe sorted by various attributes, for example, thetype of activity (production, analysis, test, etc.),site or computing element where they are beingprocessed, job submission tool, input dataset, soft-ware version and many others. The information ispresented in a bar plot and in a table. A user cannavigate to a page with very detailed informationabout a particular job, for example, the exit codeand exit reason, important time stamps of process-ing the job, number of processed events, etc.

Second, the “historical interface”, which showsjob statistics distributed over time. The historicalview allows following the evolution of the nu-meric metrics such as the number of jobs runningin parallel, CPU and wall clock consumption orsuccess rate. The historical view is useful for un-derstanding how the job efficiency behaves overtime, how resources are shared between differentactivities, and how various job failures fluctuate asa function of time.

5.2 Job Monitoring for User Analysis

Distributed analysis on the WLCG infrastructureis currently one of the main challenges of theLHC computing. As it was already mentionedearlier, transparent access to the LHC data hasto be provided for more than 5,000 scientists allover the world. Analysis users do not necessarilyhave expertise in Grid computing. The number ofactive LHC users is steadily growing. Accordingto the Dashboard statistics since the beginning of2008 more than 1,000 distinct users of the CMSVO were running analysis jobs on the WLCGinfrastructure. 100–200 distinct CMS users submittheir analysis jobs to the WLCG daily. The sig-nificant streamlining of operations and the sim-plification of end-users’ interaction with the Gridare required for effective organization of the LHCuser analysis. Simple, user-friendly, reliable moni-toring of the analysis jobs is one of the key compo-nents of the operations of the distributed analysis.Such monitoring is required not only for the physi-cists running analysis jobs but also for the analy-sis support teams. In case of CMS, most of theanalysis users interact with the Grid via the CMS

Remote Analysis Builder (CRAB). User analysisjobs can be submitted either directly to the WLCGinfrastructure or via the CRAB analysis server,which operates on behalf of the user. In the firstcase, the support team does not have access to thelog files of the users’ job or to the CRAB workingdirectory which keeps track of the task generation.To understand the reason of the problem of aparticular user’s task, the support team needs amonitoring system capable of providing completeinformation about the task processing.

To serve the needs of the analysis commu-nity and of the analysis support team, the Dash-board Task Monitoring [30] application has beendeveloped on top of the CMS job monitoringrepository.

The application provides a comprehensiveview of the task processing. It demonstrates theprogress of the task and assists in detecting ofthe eventual problems and in providing debugginginformation.

The Task Monitoring information includes thejob status of individual jobs in the task, their distri-bution by site and over time, the reason of failure,the number of processed events and the resubmis-sion history. According to the feedback of the usercommunity the tool provides intuitive navigation.Monitoring information is updated every 5 min.

The application offers a wide variety of graphi-cal plots that visually assist the user to understandthe status of the task.

The development was user-driven with physi-cists invited to test the prototype in order toassemble further requirements and identify anyweaknesses of the application. The monitoringtool has become popular among the CMS users.According to our Web statistics, more than onehundred distinct analysis users are using it fortheir everyday work.

One of the important improvements foreseenfor the Task Monitoring application is failure di-agnostics for both Grid and application failures.The ideal situation would be to reach to a pointwhere the user does not have to open the log fileto search for what went wrong, but rather findall necessary information within the monitoringtool. A development effort is ongoing to improvethe failure diagnostics reported to the Dashboardfrom the CRAB job wrapper.

Monitoring Computing Activities of LHC Virtual Organizations 333

Another attempt to improve the understandingof the underlying failure reasons of the user jobsis described in the next section.

5.3 Quick Analysis of Error Sources

Some of the monitoring tools described in this sec-tion deliver information about failing Grid jobs,including exit codes, and provide a possible failurereason. However, the exit codes do not always in-dicate the Grid component which hosts the actualfault responsible for the erroneous behavior ofGrid jobs. Instead, a more sophisticated method-ology is required to locate problematic Gridcomponents. In this context, we define a Gridcomponent as an entity involved in the executionof a job, such as Grid service, used software, asoftware version, a dataset name and , and theuser submitting the job. Within the ExperimentDashboard a tool called QAOES (Quick AnalysisOf Error Sources) was developed. Logically thesystem consists of two steps: the first aims todetect potential problems on a Grid infrastructureautomatically and the second provides previously

collected human expertise about possible solu-tions to the detected problems. The informationis merged and then presented in a Web interface,depicted in Fig. 3

5.3.1 Expert System and Association Rule Mining

Based on the list of possible problems, we collecthuman expert knowledge about the underlyingproblem diagnose and a possible solution. The ex-pert system consists of a knowledge base keepingall the information in rule-based format and aninference engine, which consults the knowledgebase to search for and retrieve rules matchingthe generated potential problems. This work isin progress. The rules implemented so far arecovering relatively simple use cases.

We applied a data mining technique called as-sociation rule mining to previously collected jobmonitoring data. This approach reveals hiddeninformation about Grid elements’ behavior bytaking dependencies between the characteristicsof failing Grid jobs into account. To generate theAssociation Rules we use the a priori algorithm

Fig. 3 The quick analysis of error sources Web interface shows one problem and one suggestion to solve the problem whichis retrieved from the expert system

334 J. Andreeva et al.

[31]. The output of this step is a list of associ-ated Grid components which cause a certain exitcode, together with quality measures (support,confidence). For example:

{SITE = A, SOFTWARE_VERSION = 5

}

→ ERROR = 7[s = 1.1%, c = 80%

]

This process is executed every hour with job mon-itoring data of the past 6 h.

6 Site Monitoring

Currently the WLCG infrastructure includesmore than 140 sites and the number of sites is stillincreasing. Operating such a large scale heteroge-neous infrastructure requires stable and reliablebehavior of the distributed services and the siteshosting these services. The LHC VOs need toextensively test all relevant aspects of the Gridsite, such as the ability to efficiently use the net-work to transfer data, the functionality of all siteservices relevant to a given VO and the capabilityto sustain the VO workflows.

The applications described in this section aimto provide a straight forward way for monitoringof the Grid sites from the perspective of a VO.

6.1 Dashboard Site Availability ApplicationBased on the Results of SAM Tests

The service availability monitoring (SAM) is aframework that allows the periodic execution oftests. These tests are setup in all the sites partic-ipating in the WLCG to verify the status of theGrid services. In addition to the tests defined bythe WLCG operators, each VO can define a set ofVO-specific tests that are performed at the sitesused by a given VO.

The Dashboard Site Availability applicationbased on the results of SAM tests was originallydeveloped following a request from the CMS ex-periment. Later on, other experiments found ituseful and the application was redesigned to begeneric and usable by any experiment using theSAM framework for site evaluation.

The application enables a possibility for the VOto define their own site availability definitions,taking into account a given set of tests andservices. In this context, VO-specific availabilitymeasures the capability of sites to provide certainfunctionalities. They can be related to a particularservice, or set of services or a particular workflow.A single VO can introduce several availabilitydefinitions.

To introduce a new availability, the VO needsto define the set of service types to be taken intoaccount for availability calculation (for example:computing element) and the set of tests which arecritical for each given service type.

The results of the tests are stored in the SAMdatabase. The stored Oracle procedures calculatethe availability based on the test results. TheDashboard SAM Web portal displays the resultsof these tests and the calculated availability.

Two different views are provided: one to checkthe latest test for a given service; the second one isto evaluate how a site or a service evolves overtime. Figure 4 presents the monthly site avail-ability for all sites supporting CMS during threedifferent time periods: January 2008, January 2009and July 2009. There has been an enormous im-provement both in the number (doubled from 28to 56) and in the quality provided by the sites. Asdemonstrated in Fig. 4 in the beginning of 200818 best sites out of 28 showed monthly availabilitybetween 60% and 80%, while in summer of 2009availability of 47 sites out of 56 was higher than80%.

The Dashboard SAM Web portal is extensivelyused by the CMS VO for the site commissioningactivity[32], which results in steady improvementof the state of the sites used by the CMS VO.Similar to other Dashboard applications, informa-tion from the Dashboard SAM Site availabilityapplication can also be retrieved in XML format.This format is used for importing data to the sitelocal fabric monitoring systems.

6.2 Site Status Board

The site status board (SSB) provides a monitor-ing framework where the monitored metrics canbe dynamically (re)defined by the user commu-

Monitoring Computing Activities of LHC Virtual Organizations 335

Fig. 4 shows monthly availability of the sites used by CMS. Every bar corresponds to a particular site, availability is in therange 0–100%. Green color corresponds to high availability, red one to the low availability

nity. SSB was designed to allow flexibility for theusers in terms of defining the monitored metrics,their criticality or in configuring the users’ views.Adding a new metric or an alternative view isstraight forward and is being performed via aprovided Web interface. The SSB is a genericapplication and can be used by any VO.

Initially, the SSB was developed for the CMSVO, for people who take computing shifts. Thegoal of the computing shifts is to follow up theCMS computing activities on the distributed in-frastructure and to take actions in case of eventualproblems. The display for the computing shiftsprovides the status of all the sites via single pagewith clear indication of any problem of any na-ture. Sometimes it might not be evident whichinformation has to be taken into account to con-clude whether a site is performing well from theperspective of the VO. Multiple iterations arerequired to define the set of important metrics.Therefore, the SSB was designed to be flexibleand allow integration of new metrics.

At the moment, the application is used by allfour LHC VOs.

Currently, CMS uses the SSB for severalpurposes:

• Computing shifts: For each site, the SSB dis-plays the status of the services, efficiency of

the job processing for production and analysis,current number of jobs, status of the datatransfers, information about downtimes, andlinks to open issues of the site

• Site commissioning: This activity observes theefficiency of test jobs submitted to the sitesand the status of the transfer links. CMS hasa well-defined procedure that sites have tofollow in order to be commissioned and beable to participate in the computing activitiesof the experiment

• Space monitoring: Displays the amount of oc-cupied and free storage space at the differentsites

Like many other Dashboard applications, the SSBhas three components: collectors that gather infor-mation, a data repository, and a Web portal thatdisplays the results. The collectors can fetch eachmetric from a different source. Moreover, eachmetric can define its own refresh rate.

VO administrators manage the metrics viaa X509-authenticated Web interface. They candefine which metrics are critical for each of thedifferent activities. For visualization purposes themetrics can be grouped in a flexible way and var-ious alternative views are enabled based on thisgrouping.

336 J. Andreeva et al.

7 Integration of the VO-Specific MonitoringSystems

The VO-specific monitoring systems work in thescope of a single experiment. Very often differentactivities of a VO are monitored by different sys-tems. There is no tool providing an overall view ofthe VO computing activities. On the other hand, ahigh-level cross-VO view is also missing.

This problem becomes more visible with thegrowing scale of activities of the LHC VOs onthe WLCG infrastructure and with the intensive,simultaneous use of the WLCG resources by allLHC VOs. Moreover, the site administrators ofsites serving several LHC VOs complained thatit is not convenient to navigate to multiple VO-specific monitoring systems in order to understandwhether their site is responding well to the needsof all LHC experiments.

The recent development aims to integrate theexisting VO-specific monitoring systems and toprovide a high-level view combining metrics fromeach of them. In this section we describe an ex-ample of such an application built on top of theVO-specific monitoring systems.

7.1 Siteview

The Siteview application is based on the Exper-iment Dashboard Framework and the GridMap[33] visualization system.

The GridMap application was developed incollaboration of CERN and the HP Company.GridMap enables appropriate visualization for thedistributed systems. The GridMap image repre-sents a map consisting of groups of cells. Eachcell can correspond to a single monitored instanceor to a group of instances organized in a hierar-chical structure. The color of the cell representsthe status of a single instance or of the group ofinstances. The size of the cell can be defined bysome important numeric monitoring metrics. Witha mouse click on a cell, it turns into a sub-mapshowing all instances of the corresponding group.There is a good match between the capability ofthe GridMap to present the status of the compli-cated hierarchical structures and the requirements

for the visualization of the distributed WLCGinfrastructure.

The Siteview integrates data from multiplemonitoring systems of the four LHC experiments,namely the MonALISA repository of ALICE,Dashboard for ATLAS DDM, ATLAS ProdSysDashboard, PhEDEx, CMS Dashboard andDirac. The repository of the common monitoringmetrics is implemented in Oracle. The applicationprovides not only a snapshot view but also thehistorical distribution of the common metrics. Thepurpose of the tool is to propagate informationabout the status of the most important VO activ-ities at a site and to provide a single entry pointfor a transparent navigation of the VO-specificmonitoring information to the site administrators.The Siteview map publishes the pre-cooked linksto the relevant information, so that even a personnot familiar with the monitoring system displayedthe primary information source, can easily findcorresponding data. The VO-specific monitoringsystems publish monitoring data in CSV formatvia the http protocol. Data is consumed by theDashboard collectors and stored in the Oracledatabase. The UI based on GridMap visualizationruns on top of the metric repository. The Siteviewuses the same database structure as the SSB appli-cation described in Section 6.2 The Siteview col-lectors were implemented following the exampleof the collectors of the SSB application. Thanks tothis, all the features implemented for the SSB areimmediately available for the Siteview.

An example of the Siteview screenshot is shownin Fig. 5.

The interface shows the map split in four parts.The first one is for the overall status of the sitefrom the perspective of the VOs. The second oneis for the job processing, the third and the fourthones are for the incoming and outgoing transfers.Every cell corresponds to a particular VO. Onmoving the mouse over a particular cell the frameshows up. It contains the most important metricsand the links to the primary information sources.An example in Fig. 5 shows job processing metrics.On the click on a particular case the sub-map isdisplayed. It shows the distribution by job process-ing types for the job processing part or by channelsfor the transfer parts.

Monitoring Computing Activities of LHC Virtual Organizations 337

Fig. 5 Example of the site view

8 Collaboration with the User Community

As mentioned in the paper, close collaborationwith the user community is essential for the suc-cess of the monitoring applications.

One of the principles followed during the de-velopment process of the Dashboard system is toinvolve potential users of a given application indesigning the user interface and in validation ofthe early prototype. In order to understand theneeds of the Dashboard users, members of theDashboard team take part in the LHC computingshifts, attend computing meetings and conductsurveys asking for feedback about recent appli-cations. Moreover, the usage of the Dashboardapplications is monitored via Web statistics tools.

User input is taken into account for definingthe content of the monitoring data collected in thedata repository, the level of its aggregation andfor designing the user interfaces. Depending onthe role of a particular group of users the samemonitoring information should be exposed in adifferent way.

Taking as an example information about datacollection used by the analysis jobs, in the Task

monitoring application described in Section 5.2,the only thing a physicist is interested in know-ing is which of his actual tasks reads a particu-lar data collection. Site administrators would liketo monitor how many users/jobs are accessing agiven data collection at their site, for how long,whether success rate of user jobs correlates with aparticular data collection, etc. . . Computing teamsresponsible for data distribution are interested inunderstanding whether replication of data collec-tions is done in an optimal way in order to ensureload balancing of the analysis jobs between vari-ous sites. All of these categories of users requirea different presentation of the same monitoringdata, different time ranges for monitoring the dataevolution and various scopes.

User requirements regarding the flexibilitywhich has to be provided by the monitoring ap-plication also strongly depend on the category ofusers. While the Task monitoring user interfaceis rather stable and very few feature requestswere submitted after validation of the applica-tion by the user community, applications usedfor computing operations constantly require quickchanges. Therefore the Site Status Board applica-

338 J. Andreeva et al.

tion which represents a sort of container which cankeep various monitoring metrics and can exposethese metrics in a customized form was developedfor computing operations.

9 Conclusions

The Experiment Dashboard system became anessential component for the LHC computing op-erations. The variety of its applications covers thefull range of the LHC computing activities. Thesystem is being developed in a very close collabo-ration with the users. As a result, the ExperimentDashboard manages to respond well to the needsof the LHC experiments.

The Dashboard team has repeatedly foundgeneric solutions during the development processwhich can be shared by all LHC VOs. Rather of-ten, the application initially requested by a singleexperiment is being adapted to the rest of thecommunity on the request of other VOs. This,certainly, allows not only to save on developmenteffort for the implementation of the new requests,but facilitates the maintenance and support ofthe existing Dashboard applications. Some of theDashboard applications , for example, generic jobmonitoring was used outside the LHC user com-munity by VLEMED virtual organization.

The Dashboard framework provides all thenecessary components for the development andmanagement of the monitoring system and it is be-ing used by other projects.

A big diversity of the VO-specific monitoringtools used by the LHC community creates certainproblems. There is no global cross-VO picture ofthe LHC activities on the WLCG infrastructure.To overcome this limitation, the recent develop-ment effort was focused on integration of themonitoring data coming from the multiple VO-specific monitoring systems and on presenting it ina consistent way. The Dashboard framework andthe GridMap visualization system are used for thispurpose.

The future evolution of the Experiment Dash-board is driven by the needs of the LHCexperiments, applying where possible commonsolutions.

Acknowledgements The authors would like to stress thatDashboard development is the result of collaboration withmany colleagues in other projects and development teamslike MonALISA, SAM, Ganga, Condor-g, LB and withthe colleagues in the LHC experiments, in particular de-velopers of the workload and data management systemsand members of the VO computing projects and task-forceteams. The authors would like to express their gratitude toall those people for the fruitful collaboration. They wouldlike to thank LHC physicists providing feedback regardinguser analysis monitoring tools. Special thank to StefanoBelforte and Andrea Sciaba for valuable contributions anduseful feedback.

This work is co-funded by the European Commissionthrough the EGEE-III project (www.eu-egee.org), con-tract number INFSO-RI-222667, and the results producedmade use of the EGEE Grid infrastructure.

References

1. Evans, L., Bryant, P. (eds): LHC machine. J. Instr.3:S08001. doi:10.1088/1748-0221/3/08/S08001 (2008)

2. The ATLAS Computing Group: ATLAS computingtechnical design report. ATLAS-TDR-017, CERN-LHCC-2005-022 (2005)

3. The CMS Computing Group: CMS computing techni-cal design report. CMS-TDR-007, CERN-LHCC-2005-023 (2005)

4. The ALICE Computing Group: ALICE comput-ing technical design report. CMS-TDR-012, CERN-LHCC-2005-020 (2005)

5. The LHCb Computing Group: LHCb computing tech-nical design report. CMS-TDR-0011, CERN-LHCC-2005-019 (2005)

6. European Organization for Nuclear Research, CERN.http://public.web.cern.ch/public/. Accessed 21 April2010

7. Shiers, J.: The Worldwide LHC Computing Grid(worldwide LCG), Computer Physics communications,vol. 177, issues 1–2, July 2007, pages 219–223, Pro-ceedings of the Conference on Computational Physics2006 - CCP (2006)

8. Rehn, J., et al.: PhEDEx high-throughput data trans-fer management system. In: CHEP06, Conference onComputing in High Energy and Nuclear Physics Pro-ceedings, Mumbai, India, (2006)

9. Tsaregorodsev A., et al.: Dirac: A community Grid so-lution. In: CHEP07, Conference on Computing in HighEnergy and Nuclear Physics Proceedings, Victoria, BC,Canada (2007)

10. Nilsson, P.: PanDA system in ATLAS Experiment.ACAT’08, Italy (2008)

11. Saiz, P., et al.: AliEn—ALICE environment on theGRID. Nucl. Instrum. Methods A502, 437–440 (2003)

12. http: / /www.hpcwire.com/offthewire/STEP09-Demonstrates-LHC-Readiness-49631242.html.Accessed 21 April 2010

Monitoring Computing Activities of LHC Virtual Organizations 339

13. Laure, E., Fisher, S.M., Frohner, A., Grandi, C.,Kunszt, P., et al.: Programming theGrid with gLite.Comput. Methods Sci. Technol. 12(1), 33–45 (2006)

14. Pordes, R., et al.: The open science Grid. J. Phys. Conf.Ser. 78, 012057 (2007)

15. Eerola, P., et al.: Roadmap for the ARC Grid middle-ware. PARA LNCS 4699 (2006)

16. CMS Dashboard stats. http://lxarda18.cern.ch/awstats/awstats.pl?config=lxarda18.cern.ch. Accessed 21 April2010

17. Karmady, R., et al.: GridView: a monitoring and vi-sualization tool. In: CHEP06, Conference on Comput-ing in High Energy and Nuclear Physics Proceedings,Mumbai, India (2006)

18. Martyniak, J., et al.: Real time monitor of Grid jobexecutions. In: CHEP09, Conference on Computing inHigh Energy and Nuclear Physics Proceedings, Prague,Chech Republic (2009)

19. Ruda, M.: A uniform job monitoring service in multiplejob universes. High Performance Distributed Comput-ing, Proceedings of the 2007 workshop on Grid moni-toring Monterey, California USA (2007)

20. Collados, D.: Evolution of SAM in an enhanced modelfor monitoring WLCG services. In: CHEP09, Con-ference on Computing in High Energy and NuclearPhysics Proceedings, Pargue, Chech Republic (2009)

21. Aiftimiei, C., P, et al.: Using CREAM and CEMONfor job submission and management in the gLite mid-dleware. In: CHEP09 Conference Proceedings, Pargue,Chech Republic (2009)

22. Moscicki, J., et al.: Ganga: a tool for computational-task management and easy access to Grid resources.Comput. Phys. Commun. arXiv:0902.2685v1

23. Spiga, D., et al.: The CMS remote analysis builder(CRAB). Lect. Notes Comput. Sci. 4873, 580–586(2007)

24. Legrand, I., Newman, H., Cirstoiu, C., Grigoras, C.,Toarta, M., Dobre, C.: MonALISA: an agent based, dy-namic service system to monitor, controland optimizeGrid based applications. In: Proceedings of Comput-ing for High Energy Physics, Interlaken, Switzerland(2004)

25. Casey, J., Rodrigues, D., Schwickerath, U., Silva, R.:Monitoring the efficiency of user jobs. In: CHEP’09:

17th International Conference on Computing in HighEnergy and Nuclear Physics, Prague, Czech Republic(2009)

26. Metson, S., et al.: CMS offline webtools. In: CHEP’07:Conference on Computing in High Energy and NuclearPhysics Procedings, Victoria, BC, Canada (2007)

27. Schulz, M., et al.: Building the WLCG file transfer ser-vice. In: CHEP07: Conference on Computing in HighEnergy and Nuclear Physics Proceedings, Victoria, BC,Canada (2007)

28. Cooke, W., et al.: The relational Grid monitoring ar-chitecture: mediating information about Grid. J. GridComput. 2(4), 1–17 (2004)

29. Thain, D., Tannenbaum, T., Livny, M.: Distributedcomputing in practice: the condor experience. Concur-rency Comput. Pract. Ex. 17(2–4), 323–356 (2005)

30. Karavkis, E., et al.: CMS dashboard for monitoringof the user analysis activities. In: CHEP’09: 17th In-ternational Conference on Computing in High En-ergy and Nuclear Physics, Prague, Czech Republic(2009)

31. Agrawal, R., Srikant, R.: Fast algorithms for miningassociation rules in large databases. In: Proceedings ofthe 20th International Conference on Very Large DataBases, VLDB, Santiago, Chile, 487–499 (1994)

32. Belforte, S., et al.: The commissioning of CMS sites:improving the site reliability. In: 17th InternationalConference on Computing in High Energy and NuclearPhysics Proceedings, Prague, Czech Republic (2009)

33. GridMap visualization, http://www.isgtw.org/?pid=1000728. Accessed 21 April 2010

34. Andreozzi, S., et al.: GridICE: a monitoring service forGrid systems. Future Gener. Comput. Syst. 21(4), 559–571 (2005)

35. Smallen, S., et al.: User-level Grid monitoring withInca 2. In: High Performance Distributed Computing,Proceedings of the 2007 workshop on Grid monitoringMonterey, California, USA (2007)

36. Andreeva, J., et al.: The experiment dsahboard formedical applications. In: 3rd EGEE User Forum,Clermont-Ferrand, France (2008)

37. Oracle Partitioning. http://www.oracle.com/us/products/database/options/partitioning/index.htm.Accessed 21 April 2010