Download - Status of PDC’07 and user analysis issues (from admin point of view) L. Betev August 28, 2007
Status of PDC’07 and user Status of PDC’07 and user analysis issues (from analysis issues (from admin point of view)admin point of view)
L. BetevL. BetevAugust 28, 2007August 28, 2007
GSI DarmstadtGSI Darmstadt 22
The ALICE GridThe ALICE Grid
Powered by AliEnPowered by AliEn Interfaces to gLite, ARC and (future) OSG WMSInterfaces to gLite, ARC and (future) OSG WMS
As of today – 65 entry points (62 sites), 4 continentsAs of today – 65 entry points (62 sites), 4 continents Africa (1), Asia (4), Europe (53), North America (4)Africa (1), Asia (4), Europe (53), North America (4) 21 countries, 1 consortium (NDGF)21 countries, 1 consortium (NDGF) 6 Tier-1 (MSS capacity) sites, 58 Tier-26 Tier-1 (MSS capacity) sites, 58 Tier-2 All together – ~5000 CPUs (pledged), 1.5PB disk, 1.5PB All together – ~5000 CPUs (pledged), 1.5PB disk, 1.5PB
TapeTape Contribution range: from 4 to 1200 CPUsContribution range: from 4 to 1200 CPUs PIII, PIV, Itanium, Xeon, AMDPIII, PIV, Itanium, Xeon, AMD All Linux: Mandriva, Suse to Ubuntu, mostly SL3/4, no All Linux: Mandriva, Suse to Ubuntu, mostly SL3/4, no
Gentoo + all possible kernel+gcc combinationsGentoo + all possible kernel+gcc combinations
GSI DarmstadtGSI Darmstadt 33
The ALICE Grid (2)The ALICE Grid (2)
62 active sites
GSI DarmstadtGSI Darmstadt 44
OperationOperation ALICE offline is:ALICE offline is:
Hosting the central AliEn services: Grid catalogue, task queue, job Hosting the central AliEn services: Grid catalogue, task queue, job handling, authentication, API services, user registrationhandling, authentication, API services, user registration
Organising (guided by the requirements of the PWGs) and running the Organising (guided by the requirements of the PWGs) and running the productionproduction
AliEn site services updates and operation (together with the regional AliEn site services updates and operation (together with the regional experts)experts)
User analysis supportUser analysis support Sites are:Sites are:
Hosting the VO-boxes (interface to site services)Hosting the VO-boxes (interface to site services) Operating the local services (gLite and site fabric)Operating the local services (gLite and site fabric) Providing CPU and storageProviding CPU and storage
This modelThis model Has been in operation with minor modification since several years and is Has been in operation with minor modification since several years and is
working quite well for productionworking quite well for production Requires minor modification to support a large user community - mostly Requires minor modification to support a large user community - mostly
in the area of user support in the area of user support
GSI DarmstadtGSI Darmstadt 55
History of PDCsHistory of PDCs
Exercise of the ALICE production model Exercise of the ALICE production model Data production / storage/ replicationData production / storage/ replicationValidation of AliRootValidation of AliRootValidation of Grid software and operationValidation of Grid software and operationUser analysis (not yet integral part of the User analysis (not yet integral part of the
PDC)PDC)Since April 2006 the PDC is running Since April 2006 the PDC is running
continuouslycontinuously
GSI DarmstadtGSI Darmstadt 66
PDC job historyPDC job history
Average of 1500 CPUs running continuously since April 2006
GSI DarmstadtGSI Darmstadt 77
PDC job history - zoom on last 2 monthsPDC job history - zoom on last 2 months
2900 jobs in average, saturating all available resources
GSI DarmstadtGSI Darmstadt 88
Site performanceSite performanceTypical operation:- Up to 10% of the sites not in production at any given moment- Half of these are undergoing scheduled upgrades- The other half - Grid or local services failures- T1s are in general better in stability than T2- Some T2s are much better than any of the T1s
Achieving better stability of theservices at the computing centresis a top priority of all parties involved
The central services availability is better than 95%The central services availability is better than 95%
GSI DarmstadtGSI Darmstadt 99
Production statusProduction status
Total 85,837,100 events as of 26/082007 24:00 hours
GSI DarmstadtGSI Darmstadt 1010
Sites contributionsSites contributionsStandard distribution: 50/50 T1/T2 contribution
GSI DarmstadtGSI Darmstadt 1111
Relative contribution - Germany Relative contribution - Germany Standard distribution: 50/50 T1/T2 contribution
15% of total
GSI DarmstadtGSI Darmstadt 1212
Efficiencies/debuggingEfficiencies/debugging
Workload management for productionWorkload management for production Under control and is near production qualityUnder control and is near production quality We keep saying that, but this time we really mean itWe keep saying that, but this time we really mean it Improvements (speed, stability) are expected with the new gLite Improvements (speed, stability) are expected with the new gLite
version 3.1, still untested version 3.1, still untested Support and debuggingSupport and debugging
The overall situation is much less fragile nowThe overall situation is much less fragile now Substantial improvements in AliEn and monitoring are making Substantial improvements in AliEn and monitoring are making
the work of the experts supporting the operations easierthe work of the experts supporting the operations easier gLite services at the sites are (mostly) well understood and gLite services at the sites are (mostly) well understood and
supportedsupported User support is still very much in need of improvementUser support is still very much in need of improvement
The issues with user analysis are often unique and sometimes The issues with user analysis are often unique and sometimes lead to development of new functionalitylead to development of new functionality
But at least the response time (if not the solution) is quick But at least the response time (if not the solution) is quick
GSI DarmstadtGSI Darmstadt 1313
GeneralGeneral
The Grid is getting betterThe Grid is getting better Running conditions are improvingRunning conditions are improving The Grid middleware in general and AliEn in particular are quite The Grid middleware in general and AliEn in particular are quite
stablestable After a long and hard work by the developersAfter a long and hard work by the developers
Even user analysis, much derided in the past few months is finally Even user analysis, much derided in the past few months is finally not a painful exercisenot a painful exercise
The operation is more streamlined nowThe operation is more streamlined now Better understanding of running conditions and problems by the Better understanding of running conditions and problems by the
expertsexperts
We continue with the usual PDC’07 programmeWe continue with the usual PDC’07 programme Simulation/reconstruction of MC eventSimulation/reconstruction of MC event Validation of new middleware componentsValidation of new middleware components User analysisUser analysis And in addition the Full Dress Rehearsal (FDR)And in addition the Full Dress Rehearsal (FDR)
GSI DarmstadtGSI Darmstadt 1414
User analysis issues - short listUser analysis issues - short listMajor issues - February/June 2007Major issues - February/June 2007
Jobs do not start/lost/output missingJobs do not start/lost/output missing Input data collections are difficult to handle Input data collections are difficult to handle
and impossible to process at onceand impossible to process at oncePriorities are not set - single user can ‘grab’ Priorities are not set - single user can ‘grab’
all resources all resources Unclear definition of storage elements Unclear definition of storage elements
(Disk/MSS)(Disk/MSS)
GSI DarmstadtGSI Darmstadt 1515
User analysis issues - short list (2)User analysis issues - short list (2) What has been doneWhat has been done
Failover CE for user queue (Grid partition ‘Analysis’)Failover CE for user queue (Grid partition ‘Analysis’) Since 20 June - 100% availabilitySince 20 June - 100% availability
Pre staging of data (available on spinning media) and Pre staging of data (available on spinning media) and creation of xml collections centrallycreation of xml collections centrally
The availability of the pre-staged files is checked periodicallyThe availability of the pre-staged files is checked periodically More robust central services (see previous slides)More robust central services (see previous slides) Use of dedicated SE for user files - this will be Use of dedicated SE for user files - this will be
transparently increased to multile SEs with quotastransparently increased to multile SEs with quotas Priority mechanism (not the final version) put in placePriority mechanism (not the final version) put in place
We haven’t had reports of unfair useWe haven’t had reports of unfair use
GSI DarmstadtGSI Darmstadt 1616
Job completion chart Job completion chart Standard distribution: 50/50 T1/T2 contribution
User jobs
GSI DarmstadtGSI Darmstadt 1717
User analysis issues - currentUser analysis issues - current Storage availability and consistencyStorage availability and consistency
Still very few working SEs - common storage solutions are not Still very few working SEs - common storage solutions are not yet ‘production’ qualityyet ‘production’ quality
The effort is now concentrated on CASTOR2 with xrootdThe effort is now concentrated on CASTOR2 with xrootd Sites (GSI f.e.) are installing large xrootd pools - these are Sites (GSI f.e.) are installing large xrootd pools - these are
tested and workingtested and working With more SEs, holding replicas of the data, the Grid will With more SEs, holding replicas of the data, the Grid will
naturally become more stablenaturally become more stable Availability of specific data setsAvailability of specific data sets
Dependent on the storage capacity in operationDependent on the storage capacity in operation Currently TPC RAW data is being replicated to GSICurrently TPC RAW data is being replicated to GSI With CASTOR2+xrootd working, the number of events on With CASTOR2+xrootd working, the number of events on
spinning media will increase 20xspinning media will increase 20x
GSI DarmstadtGSI Darmstadt 1818
User analysis issues - current (2)User analysis issues - current (2)User applicationsUser applications
Compatibility of user installation of ROOT, gcc Compatibility of user installation of ROOT, gcc version, OS - locally complied application will not version, OS - locally complied application will not necessarily run on the Gridnecessarily run on the Grid
All sites are installed with ‘lowest common All sites are installed with ‘lowest common denominator’ middleware and packages - currnetly denominator’ middleware and packages - currnetly SLC3, gcc v.3.2, while most users have gcc v.3.4SLC3, gcc v.3.2, while most users have gcc v.3.4
There is no easy way out, until the centres migrate to There is no easy way out, until the centres migrate to SL(C)4 and gcc v.3.4SL(C)4 and gcc v.3.4
Meanwhile, the experts are looking into repackaging Meanwhile, the experts are looking into repackaging the Grid apps (most notably gshell) the Grid apps (most notably gshell)
Currently the only solution is to always compile ROOT Currently the only solution is to always compile ROOT and user application with the same compiler, before and user application with the same compiler, before submitting to Gridsubmitting to Grid