atlas on grid3/osg
DESCRIPTION
ATLAS on Grid3/OSG. R. Gardner December 16, 2004. ATLAS Applications. Pythia Generation Geant4 simulation Pileup Digitization Reconstruction. ATLAS Users. DC2 production team Managed production High priority 7 users User production Opportunistic production and reconstruction - PowerPoint PPT PresentationTRANSCRIPT
1
ATLAS on Grid3/OSG
R. Gardner
December 16, 2004
2
ATLAS Applications
• Pythia Generation
• Geant4 simulation
• Pileup
• Digitization
• Reconstruction
3
ATLAS Users
• DC2 production team – Managed production– High priority– 7 users
• User production– Opportunistic production and reconstruction– 3 users– growing
4
#Job status
Capone Total
1 failed 33165
2 finished 90534
3 running 101
4 submitted 42
ATLAS DC2 on Grid3
• Production statistics on Grid3 (End of November 2004)
• Overall “success” rate: 74% – Through September: 66%
– During last 2 months:• finished: 53163 failed:14353
success rate: 78%.
• We improved our results since (September)
• Only 2-3 submit-clients now (10-20 in September )
5
Job Success Rate on GRID3
PassedPassed FailedFailed Success RateSuccess Rate
JulyJuly 87998799 66766676 57%57%
AugustAugust 1708317083 94489448 64%64%
SeptemberSeptember 1728317283 77177717 69%69%
OctoberOctober 2660026600 51865186 84%84%
NovemberNovember 2186921869 50385038 81%81%
Key factors in improved success rate:Key factors in improved success rate: Experienced team using common submit hosts Quicker response to large scale site/network/hardware failures
Can we improve more?Can we improve more? Some shifts >95% success, others <50% Automatic throttle for failures? But still lose all running jobs Do we care?
K. De+ improvements to Capone/GCE
6
# CE Gatekeeper Finished+Failed Jobs Finished Jobs Failed Success Rate (%)
1 BU_ATLAS_Tier2 19395 16349 3046 84.29
2 UTA_dpcc 19214 14634 4580 76.16
3 UC_ATLAS_Tier2 13285 11196 2089 84.28
4 BNL_ATLAS 11261 8993 2268 79.86
5 IU_ATLAS_Tier2 10528 8403 2125 79.82
6 UM_ATLAS 9434 6054 3380 64.17
7 BNL_ATLAS_BAK 6061 4578 1483 75.53
8 UBuffalo_CCR 4654 3992 662 85.78
9 PDSF 5075 3590 1485 70.74
10 FNAL_CMS 3857 2222 1635 57.61
11 CalTech_PG 3136 2178 958 69.45
12 UCSanDiego_PG 2828 2101 727 74.29
13 FNAL_CMS2 2157 1506 651 69.82
14 SMU_Physics_Cluster 1462 969 493 66.28
15 BU_AGT_Tier2 975 820 155 84.10
16 PSU_Grid3 769 583 186 75.81
17 OU_OSCER 843 575 268 68.21
18 UFlorida_PG 946 451 495 47.67
19 Rice_Grid3 569 370 199 65.03
20 UWMadison 803 363 440 45.21
21 UNM_HPC 502 347 155 69.12
22 OU_OSCER_LSF 412 251 161 60.92
ATLAS ProdDB
7
Detailed Job Failures(un-normalized)
Failure Total, till Nov. Total, till Sep. Last 2 months
Submission 894 472 422
Execution 428 428 0
Post Run 10131 1147 8984
Stage-Out 10833 8037 2796
RLS 1065 989 76
Capone 3975 2725 1250
Windmill 564 57 507
Other 5225 5139 86
TOTAL 33165 19303 13862
8
Status of GRID3 Jobs
evgen simul digi pile-up
Done % Done % Done % Done %dc2.003003.B1_jets_180 100 100% 19998 100% 11899 60% 14833 74%dc2.003028.A9_susy 400 100% 11409 71% 7992 50%dc2.003034.J1_Pt_17_35 2 100% 400 100% 400 100%dc2.003035.J2_Pt_35_70 2 100% 400 100% 400 100%dc2.003036.J3_Pt_70_140 2 100% 400 100% 400 100%dc2.003037.J4_Pt_140_280 2 100% 400 100% 400 100%dc2.003038.J5_Pt_280_560 2 100% 400 100% 400 100%dc2.003039.J6_Pt_560_1120 2 100% 400 100% 400 100%dc2.003040.J7_Pt_1120_2240 1 100% 200 100% 200 100%dc2.003041.J8_Pt_2240 1 100% 200 100% 200 100%dc2.003043.B2_gamjet 4000 100% 3990 100%dc2.003054.B3_Bmumu 4300 86% 0%dc2.003080.B4_jets17 9606 96% 0%
To Do – extra A9 simulation, some digitization and some B1 pile-upNote – also waiting for some B3 and B4 input evgen files from LCG
K. De
9
ATLAS historical use
ACDC archive
10
ATLAS Jobs by site
ACDC archive
11
ATLAS Production - Number of Jobs - 30 November
-50000
0
50000
100000
150000
200000
250000
300000
4062
3
4062
8
4070
3
4070
8
4071
3
4071
8
4072
3
4072
8
4080
2
4080
7
4081
2
4081
7
4082
2
4082
7
4090
1
4090
6
4091
1
4091
6
4092
1
4092
6
4100
1
4100
6
4101
1
4101
6
4102
1
4102
6
4103
1
4110
5
4111
0
4111
5
4112
0
4112
5
Days
Nu
mb
er
of
job
s
LCGNorduGridGrid3Total
12
Grid3OSG Resource Availability
• ATLAS expects to be running continuous production starting now throughout 2005
• This activity consists of:– Completion of DC2– Production for the Rome physics workshop in June– User production via Capone clients– Distributed analysis via ADA
• Expect trend towards resource saturation to continue as more users are equipped with job submission tools
13
Some OSG Issues
• Managed storage is now the biggest problem facing continued DC2 production– for both access and space management
• Authorization – role based, access rights, queue priorities– policy infrastructure, publication
• Accounting service– user-level what resources have been used– cpu, storage over an arbitrary time period
• Operations – extend operations protocol between BNL Tier1 and iGOC/OSG operations activity