condor at brookhaven

Condor at Brookhaven

Xin Zhao, Antonio ChanBrookhaven National Lab

CondorWeek 2009Tuesday, April 21

Outline

• RACF background• RACF condor batch system• USATLAS grid job submission using condor-g

RACF• Brookhaven (BNL) is multi-disciplinary DOE lab.• RHIC and ATLAS Computing Facility (RACF)

provides computing support for BNL activities in HEP, NP, Astrophysics, etc.– RHIC Tier0– USATLAS Tier1

• Large installation – 7000+ cpus, 5+ PB of storage, 6 robotic silos with

capacity of 49,000+ tapes• Storage and computing to grow by a factor ~5 by

2012.

New Data Center risingNew data center will increase floor space by a factor ~2 in summer of 2009.

BNL Condor Batch System

• Introduced in 2003 to replace LSF.• Steep learning curve – much help from

Condor staff.• Extremely successful implementation.• Complex use of job slots (formerly VM’s)

to determine job priority (queues), eviction, suspension and back-filling policies.

Condor Queues

• Originally designed with vertical scalability– Complex queue priority configuration per core– Maintainable with old less core hardware

• Changed to horizontal scalability in 2008– More and more Multi-core hardware now– Simplified queue priority configuration per core– Reduce administrative overhead

Condor Policy for ATLAS (old)

ATLAS Condor configuration (old)

Condor Policy @ BNL

ATLAS Condor configuration (new)

Condor Queue Usage

Job Slot Occupancy (RACF)

• Left-hand plot is for 01/2007 to 06/2007.• Right-hand plot is for 06/2007 to 05/2008.• Occupancy remained at 94% between the two periods.

Job Statistics (2008)

• Condor usage by RHIC experiments increased by 50% (in terms of number of jobs) and by 41% (in terms of cpu time) since 2007.

• PHENIX executed ~50% of its jobs in the general queue.• General queue jobs amounted to 37% of all RHIC Condor jobs during

this period.• General queue efficiency increased from 87% to 94% since 2007.

Near-Term Plans

• Continue integration of Condor with Xen virtual systems.

• OS upgrade to 64-bit SL5.x – any issues with Condor?

• Condor upgrade from 6.8.5 to stable series 7.2.x• Short on manpower – open Condor admin position

at BNL. If interested, please talk to Tony Chan.

Condor-G Grid job submission

• BNL, as USATLAS Tier1, provides support to the ATLAS PanDA production system.

PanDA Job Flow

• One critical service is to maintain PanDA autopilot submission using Condor-G – Very large number (~15000) of current pilot jobs

as a single user– Need to maintain very high submission rate

• Autopilot attempts to always keep a set number of pending jobs in every queue of every remote USATLAS production sites

– Three Condor-G submit hosts in production• Quad-core Intel Xeon E5430 @ 2.66GHz, 16G Memory

and two 750GB SATA drives (mirrored disks)

• We work closely with condor team to tune Condor-G for better performance. Many improvements have been implemented and suggested by Condor team.

Weekly OSG Gratia Job Count Report for USATLAS VO

New Features and Tuning of Condor-G submission

(not a complete list)

• Gridmanager publishes resources classads to collector, users can easily query and get the grid job submission status to all remote resources.

$> condor_status -grid

Name Job Limit Running Submit Limit In Progress gt2 atlas.bu.edu:211 2500 376 200 0 gt2 gridgk04.racf.bn 2500 1 200 0 gt2 heroatlas.fas.ha 2500 100 200 0 gt2 osgserv01.slac.s 2500 611 200 0 gt2 osgx0.hep.uiuc.e 2500 5 200 0 gt2 tier2-01.ochep.o 2500 191 200 0 gt2 uct2-grid6.mwt2. 2500 1153 200 0 gt2 uct3-edge7.uchic 2500 0 200 0

• Nonessential jobs– Condor assumes every job is important, it carefully

holds and retries• Pile-up of held jobs often clogs condor-g, prevents

it from submitting new jobs – A new job attribute , Nonessential, is introduced.

• Nonessential jobs will be aborted instead of being put on hold.

– Suited for “pilot” jobs• pilots are job sandbox, not real job payload. Pilots

themselves are not as essential as real jobs.• Job payload connects to PanDA server through its

own channel. PanDA server knows their status and can abort them directly if needed.

• GRID_MONITOR_DISABLE_TIME– New configurable condor-g parameter

• Controls how long condor-g waits, after a grid monitor failure, before submitting a new grid monitor job

– Old default value of 60 minutes is too long • New job submission quiet often pauses during the wait time, job

submission can not sustain at high rate level– New value is 5 minutes

• Much better submission rate seen in production. – Condor-G developers have plan to trace the underneath

Grid monitor failures, in Globus context

• Separate throttle for limiting jobmanagers based on their role – Job submission won’t compete with job

stage_out/removal• Globus bug fix

– GRAM client (inside GAHP) stops receiving connections from remote jobmanager for job status updates.

– We ran cronjob to periodically kill GAHP server to clear up the connections issue. Slower job submission rate.

– New condor-g binary compiles against newer Globus libraries, so far so good. Need more time to verify.

• Some best practices in Condor-G submission– Reduce frequency of voms-proxy renewal on the

submit host• Condor-G aggressively pushes out new proxies to all jobs• Frequent renewal of voms-proxy on the submit hosts slow

down job submission– Avoid hard-kill jobs (-forcex) from client side

• Reduces job debris on the remote gatekeepers• On the other hand, on the remote gatekeepers, we

need to more aggressively clean up debris

Near-Term Plans

Continue the good collaboration with condor team for better performance of condor/condor-g in our production environment.

condor at brookhaven

Documents

condor usage

condor upgrade

condor team

condor staff

tuning of condor

condor queuesoriginally

brookhaven condor

rhic condor jobs