service challenge report federico carminati gdb – january 11, 2006

Service Challenge Report

Federico CarminatiGDB – January 11, 2006

2Service Challenge Update

Goals of ALICE-SC3• Verification of our distributed computing infrastructure– Do we have a working solution following our design as presented to the BS-WG

• Production of meaningful physics• Not so much a test of the complete computing model as described in the TDR

• In this sense complementary to what other experiments did


General running statistics • Event sample (last two months running)

– 22,500 jobs completed (Pb+Pb and p+p)• Average duration 8 hours, 67,500 cycles of jobs• Total CPU work: 540 kSi2k hours• Total output: 20 TB (90% CASTOR2, 10% Site SE’s)

• Centres participation (22 total)– 4 T1’s: CERN, CNAF, GridKa, CCIN2P3– 18 T2’s: Bari (I), Clermont (FR) , GSI (D), Houston (USA) , ITEP (RUS), JINR (RUS) , KNU (UKR), Muenster (D), NIHAM (RO), OSC (USA), PNPI (RUS), SPbSU (RUS), Prague (CZ), RMKI (HU), SARA (NL), Sejong (SK), Torino (I), UiB (NO)


General running statistics – II • Jobs done repartition per site

– T1’s: CERN 19%, CNAF 17%, GridKa 31%, CCIN2P3 22%•Very evenly distribution among the T1’s

– T2’s: total of 11%•Good stability at: Prague, Torino, NIHAM, Muenster, GSI, OSC

•Some under-utilization of T2 resources – more centres available, could not install the Grid software to use fully


Efficiency• Event failures

– 562 jobs persistent (up to 3 retries) AliRoot failure (2.5%)

– Errors saving, downloading input files – non-persistent and due to temporary services malfunction

– All other errors (application software area not visible, connectivity issues, black holes) are non-existent with the Job agent model – jobs are simply not pulled from TQ


System stress test• Goals of the test

– Central services behaviour• Many of the past problems (large number of proxies, overload of server machines, etc…) improved with AliEn v.2-5 and through redistribution of central services

– Site services behaviour (VO-boxes, interaction with LCG)• Connection to central services, stability, job submission to RB): improved with AliEn v.2-5

– CERN SE behaviour (CASTOR2)• Overflow of xrootd tactical buffer: improved with additional protection in migration scripts


System stress test• General targets

– Number of concurrently running jobs: 2500/24 hours (7500 jobs total)

– Storage facilities: CASTOR2, 15K files (2 per job), each file is archive of 5 root files, 7.5 TB total

• Special target– GridKa provides 1200 job slots – test of VO-box


Running job profile2450 jobs

Negative slope: see results(4)


Results • VO-box behaviour

– No problems with services running, no interventions necessary

– Load profile on VO-boxes – in average proportional to the number of jobs running on the site, nothing specialCERN

GridKA


ALICE VO-specific and LCG software• Positive

– Stable set of central services for production (user authentication, catalogue, single task queue, submission, retrieval of jobs)

– Well established (and simple) installation methods for the ALICE specific VO-Box software

– Good integration with the LCG VO-Box stack– Demonstrated scalability and robustness of the VO-Box

model– Successful mass job submission through the LCG WMS

• Issues– Rapid updates of the MW problematic with inclusion of

more computing centres on a stable basis• However all centres in the SC3 plan were kept up-to-date• Essentially due to a limited number of experts, currently focused on the software development

– Not all services thoroughly tested, in particular LFC and FTS


Operations• Positive

– Four months of continuous MC generation and reconstruction of data

– Physics contents of the events requested by the ALICE PWGs– Up to 2,400 simultaneous running jobs, repartitioned over

4 T1’s (CCIN2P3, CNAF, GridKa, CERN) and over 10 T2’s, fair share of jobs over the T1s

– Large number of events available for user analysis– Very good response adjustment of centres to the change of

operation and in general a stable operation over time• Issues

– Utilization of resources not flat, due to changes in software and the required tune-in at each centre

– Some communication problem with the centres on exactly what the VO is doing and how - being remedied now by the Task Force

– Lack of full documentation is preventing more site experts to participate in the VO operation and support


Storage management• Positive

– Full migration of storage to CASTOR2, extended tests of writing/reading and stability of operation

– Most limitations overcome with the new system– Fast response time of experts to operational issues

and quick resolution of problems– Extensive tests of xrootd as file transport protocol

and tactical SE • Issues

– CASTOR2 still lacks functionality compared to CASTOR– Badly need a uniform functionality of SRM across

platforms and back-ends, allowing for general (not only FTS-type) SE tests, especially for user data analysis

– Too much “plumbing” still laying bare, higher level services needed for FTS


Support• Positive

– Very good interaction with LCG deployment team and site experts

– Focused Task Force meeting with participation of VO, LCG and computing centres experts has shown to be a very efficient discussion forum

• Issues– Relative non-maturity of the software necessitates high level of expertise for deployment and support

– Not clear how to “bring together” the experience of the different TaskForces


Application software• Positive

– Extensive testing “on the Grid” and on several platforms (ia32, ia64, Opteron) of all basic components of the ALICE offline software (ROOT, AliRoot), allowing for code debugging and optimization

– Less than 2% of failure rate of jobs due to application software problems

• Issues– Not all centres (especially T2’s) can cope with the high demands of the applications in terms of memory/CPU utilization• This is to be urgently addressed in the hardware guidelines


The Future• We (ALICE) have a working system

– Or rather a reasonable path to it• Computer centres have learned a lot and the wood starts appearing beyond the trees

• Now we need to test the computing model as explained in the TDR

• And in particular the analysis, which is basically untested!

• SC4 is our next (and last chance)• However we are very worried about timing and stability


SC4• Whatever will be there at SC4 will be there for LHC startup

• …and conversely• Planning needs to be done in a couple of weeks– Timeline– Prioritisation and “weighting” of tasks (F.Donno’s list)

– Efficient distribution of tasks• Extremely important is focusing of efforts

– EGEE, Deployment, experiments must be working with the same target and the same objectives on the same products

• We may still make it, but we need a “change in derivative”


SC4• Contingency for ALICE is zero now• We can only fix and optimise what is there, we have no time or effort for any change of plan– Hoping that no major problem of design is lurking there

• We believe the same is true for everybody– See J.Shiers presentation at this meeting “it takes a Long time to setup a production system”


Schedule – ALICE internal• January 2006

– Rerun of SC3 disk – disk transfers (max 150MB/s)– We should get ready to do this with the current data triggered via AliEn

jobs or scheduled transfers• March 2006

– T0-T1 “loop-back” tests at 2 x nominal rate (CERN)– We run our bulk production and send data back to CERN (when do we start?)– (We get ready with proof@caf)

• April 2006– T0-T1 disk-disk (nominal rates) disk-tape (50-75MB/s)– First chance to push out data, reconstruction at CERN– (First tests with proof@caf)

• July 2006– T0-T1 disk-tape (nominal rates)– T1-T1, T1-T2, T2-T1 and other rates TBD according to CTDRs– Second chance to push out the data– Reconstruction at CERN and remote centres

• September 2006– Scheduled analysis challenge– Unscheduled challenge (target T2’s?)


Conclusions• Continuous tests of the ALICE computing model and

adoption/tests of the central and distributed Grid services as discussed and agreed by the Baseline Services group

• Much better understanding of functioning of the entire software chain (capabilities/limitations)– Once more showed the need to concentrate on a set of

well-defined services• Zero time for integration of untested tools and

software– Especially if they require substantial modification

of the existing ones• Must put all efforts this year on making the

existing services “production quality” and especially on end user analysis

service challenge report federico carminati gdb – january 11, 2006

Documents