u.s. atlas tier-1 site report michael ernst [email protected] u.s. atlas facilities workshop – march...
TRANSCRIPT
![Page 1: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/1.jpg)
U.S. ATLAS Tier-1 Site Report
Michael Ernst [email protected]
U.S. ATLAS Facilities Workshop – March 23, 2015 1
![Page 2: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/2.jpg)
2
• Capacity planning based on Pledges (23% author share) + 20% for US Physicists• FY15 Equipment Deployment
– Equipment (i.e. CPU, disk, central servers) replenished after 4-5 years of operation
Tier-1 High Value Equipment Deployment Plan
![Page 3: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/3.jpg)
Tier-1 Middleware Deployment (1/2)
• Below Middleware services are running on VMs• 8 CEs, two of them accept both OSG and ATLAS jobs, the other six
are dedicated to ATLAS jobs:– gridgk01 (BNL_ATLAS_1) is GRAM CE, for all OSG VOs, OSG release
3.2.13– gridgk02 (BNL_ATLAS_2) is HTCondor CE, for all OSG VOs, OSG release
3.2.16– gridgk03 (BNL_ATLAS_3) is GRAM CE, OSG release 3.2.13– gridgk04 (BNL_ATLAS_4) is GRAM CE, OSG release 3.2.13– gridgk05 (BNL_ATLAS_5) is GRAM CE, OSG release 3.2.13– gridgk06 (BNL_ATLAS_6) is GRAM CE, OSG release 3.2.13– gridgk07 (BNL_ATLAS_7) is HTCondor CE, OSG release 3.2.18– gridgk08 (BNL_ATLAS_8) is GRAM CE, OSG release 3.2.19
3
![Page 4: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/4.jpg)
Tier-1 Middleware Deployment (2/2)
• we have two GUMS servers
gums.racf.bnl.gov, GUMS-1.3.18gumsdev.racf.bnl.gov, GUMS-1.3.18
They are configured identical, although one is production and the other is dev.
• we have one RSV monitoring host that runs RSV probes against all CEs and the dCache SE.
services02.usatlas.bnl.gov, OSG 3.2.12• 3 redundant Condor-G submit hosts for ATLAS APF
4
![Page 5: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/5.jpg)
5https://www.racf.bnl.gov/docs/services/cvmfs/info
![Page 6: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/6.jpg)
6
![Page 7: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/7.jpg)
7
![Page 8: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/8.jpg)
Previous, Ongoing and New ATLAS cloud efforts/activities
John Hover
Previous/OngoingSome smaller-scale commercial cloud R&D (Amazon, Google) performed at BNL
and CERNCurrently running at medium scale (1000 jobs) on ~10 academic clouds using
University of Victoria Cloud Scheduler.Heavy investment/conversion to Openstack at CERN, majority of Central
Services now running on Agile InfrastructureOpenstack cluster at Brookhaven National Lab (720 cores)NewCollaboration with AWS and ESnet on large-scale exploration of AWS resources
for ATLAS Production and Analysis (~30k concurrent jobs in Nov) Aiming at scaling up to ~100k concurrent jobs running in 3 AWS regions, 100G
AWS ESnet Network PeeringsUse of S3 Storage AWS-internally and between AWS and ATLAS object Storage
Instances
![Page 9: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/9.jpg)
![Page 10: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/10.jpg)
BeStMan SEDistributed among 3 EC2 VMs 1 SRM 2 GridFTP server
Simple Storage Service(S3)
Amazon Web Services
ATLAS PRODUCTON DATA DISTRIBUTION SERVICES
SRM/GridFTP protocol data transmission
S3 / HTTP(S) direct access via FTS or APIs
S3/ HTTP(S) via S3FS
S3fs (Fuse based file system)3 buckets mapped into 3 mount points per VM / region:
ATLASUSERDISK ATLASPRODDISK
ATLASDATADISK
Example us-east-1 region
![Page 11: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/11.jpg)
“Impedance mismatch” between commercial and scientific computing
![Page 12: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/12.jpg)
ATLAS Autonomous Multicore Provisioning
Dynamic Allocation with CondorWilliam Strecker-Kellogg
![Page 13: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/13.jpg)
ATLAS Tree Structure
• Use hierarchical group-quotas in Condor– Leaf-nodes in the hierarchy get jobs submitted to
them and correspond 1:1 with panda-queues– Surplus resources from underutilized queues are
automatically allocated to other, busier queues• Quotas determine steady-state allocation when all queues
are busy– Quota of parent groups are the sum of their children’s
quotas
(see next slide for diagram)
![Page 14: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/14.jpg)
ATLAS Tree Structure
atlas (12000)
analysis (2000)
prod (10000)
himem (1000)
single (3500)
mcore (5500)
short (1000) long (1000)
grid (40)
<root>
(quota)
![Page 15: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/15.jpg)
Surplus Sharing
• Surplus sharing is controlled by boolean accept_surplus flag on each queue– Quotas / surplus are normalized in units of CPUs
• Groups with flag can share with their siblings– Parent groups with flag allow surplus to “flow
down” the tree from their siblings to their children– Parent groups without accept_surplus flag
constrain surplus-sharing to among their children
![Page 16: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/16.jpg)
Surplus Sharing
• Scenario: analysis has quota of 2000 and no accept_surplus; short and long have a quota of 1000 each and accept_surplus on– short=1600, long=400…
possible– short=1500, long=700…
impossible (violates analysis quota)
![Page 17: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/17.jpg)
Partitionable Slots
• Each batch node is configured to be partitioned into arbitrary slices of CPUs– Condor terminology:
• Partitionable slots are automatically sliced into dynamic slots
• Multicore jobs are thus accommodated with no administrative effort– Farm is filled depth first (default is breadth first)
to reduce fragmentation• Only minimal (~1-2%) defragmentation necessary
![Page 18: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/18.jpg)
Where’s the problem?• Everything works perfectly with all single-core• However… Multicore jobs will not be able to
compete for surplus resources fairly– Negotiation is greedy, if 7 slots are free, they won’t
match an 8-core job but will match 7 single-core jobs in the same cycle
• If any multicore queues compete for surplus with single core queues, the multicore will always lose
• A solution outside Condor is needed– Ultimate goal is to maximize farm utilization
![Page 19: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/19.jpg)
Dynamic Allocation
• A script watches panda queues for demand– Queues that have few or no pending jobs are
considered empty– Short spikes are smoothed out in demand
calculation
• Script is aware of Condor’s group-structure– Builds tree dynamically from database
• This facilitates altering the group hierarchy with no rewriting of the script
![Page 20: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/20.jpg)
Dynamic Allocation
• Script figures out which queues are able to accept_surplus– Based on comparing “weight” of queues
• Weight defined as size of job in queue (# cores)– Able to cope with any combination of demands– Prevents starvation by allowing surplus into
“heaviest” queues first• Avoids both single-core and multicore queues competing for
the same resources– Can shift balance between entire sub-trees in
hierarchy (e.g. analysis <--> production)
![Page 21: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/21.jpg)
Results
![Page 22: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/22.jpg)
Results
• Dips in serial production jobs (magenta) are filled in by multicore jobs (pink)– Some inefficiency remains due to fragmentation
• There is some irreducible average wait-time for 8 cores on a single machine to become free
• Results look promising, will even allow opportunistic workload to backfill if all ATLAS queues drain– Currently impossible as Condor doesn’t support
preemption of dynamic slots… the Condor team is close to providing a solution
![Page 23: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/23.jpg)
OSG Opportunistic Usage at the Tier-1 Center
Simone Campana - ATLAS SW&C Week
23Bo Jayatilaka
![Page 24: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/24.jpg)
OSG Opportunistic Usage at the Tier-1 Center
Simone Campana - ATLAS SW&C Week
24
![Page 25: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/25.jpg)
Tier-1 Production Network Connectivity
• BNL connected to ESnet at 200G• Have Tier-1 facility connected to ESnet at 200G
via BNL Science DMZ• 30G OPN production circuit• 10G OPN backup circuit• 40G General IP Services• 100G LHCONE production circuit• All circuits can “burst” to the maximum of 200G, depending on
available b/w
![Page 26: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/26.jpg)
OPN
R&E +Virtual Circuits
LHCONE
ATLAS Software and Computing Week - October 24, 2013
26
2.5 Gbps
3.5 Gbps
BNL WAN Connectivity in 2013
![Page 27: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/27.jpg)
BNL Perimeter and Science DMZ
27
![Page 28: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/28.jpg)
Current Implementation
![Page 29: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/29.jpg)
12PB
WNs
![Page 30: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/30.jpg)
LAN connectivity WN T1 Disk Storage
30
All WN connectedat 1 Gbps
Typical bandwidth10–20 MB/s, Peak at 50 MB/s
Analysis queuesconfigured forDirect read
![Page 31: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/31.jpg)
BNL in ATLAS Distributed Data Management (Apr – Dec 2012)Data Transfer activities between BNL and other ATLAS Sites
31
13.5 PB
Monthly average transfer rate up to 800 MB/s; daily peaks have been observed 5 times higher
BNL to ATLAST1s & T2s
2PB
BNL in navy blue
ATLAS Distributed DataManagement Dashboard
MB/s
Data Export
Data Import
![Page 32: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/32.jpg)
CERN/T1 -> BNL Transfer Performance
• Regular ATLAS Production + Test Traffic
• Observations (all in the context of ATLAS)– Never exceeded ~50 Gbits/sec– CERN (ATLAS EOS) -> BNL limited at ~1.5 GB/s
• Achieved >60 Gbits/s between 2 Hosts @ CERN and BNL
![Page 33: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/33.jpg)
FTS 3 Service at BNL
33
![Page 34: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/34.jpg)
From BNL to T1s
34
![Page 35: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/35.jpg)
35
![Page 36: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/36.jpg)
From T1s to BNL
36
![Page 37: U.S. ATLAS Tier-1 Site Report Michael Ernst mernst@bnl.gov U.S. ATLAS Facilities Workshop – March 23, 2015 1](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649dda5503460f94ad0b9d/html5/thumbnails/37.jpg)
From T1s and T2s to BNL in 2012 - 2013
37