condor at brookhaven

24
Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21

Upload: baina

Post on 24-Feb-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Condor at Brookhaven. Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21. Outline. RACF background RACF condor batch system USATLAS grid job submission using condor-g. RACF. Brookhaven (BNL) is multi-disciplinary DOE lab. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Condor at Brookhaven

Condor at Brookhaven

Xin Zhao, Antonio ChanBrookhaven National Lab

CondorWeek 2009Tuesday, April 21

Page 2: Condor at Brookhaven

Outline

• RACF background• RACF condor batch system• USATLAS grid job submission using condor-g

Page 3: Condor at Brookhaven

RACF• Brookhaven (BNL) is multi-disciplinary DOE lab.• RHIC and ATLAS Computing Facility (RACF)

provides computing support for BNL activities in HEP, NP, Astrophysics, etc.– RHIC Tier0– USATLAS Tier1

• Large installation – 7000+ cpus, 5+ PB of storage, 6 robotic silos with

capacity of 49,000+ tapes• Storage and computing to grow by a factor ~5 by

2012.

Page 4: Condor at Brookhaven

New Data Center risingNew data center will increase floor space by a factor ~2 in summer of 2009.

Page 5: Condor at Brookhaven

BNL Condor Batch System

• Introduced in 2003 to replace LSF.• Steep learning curve – much help from

Condor staff.• Extremely successful implementation.• Complex use of job slots (formerly VM’s)

to determine job priority (queues), eviction, suspension and back-filling policies.

Page 6: Condor at Brookhaven

Condor Queues

• Originally designed with vertical scalability– Complex queue priority configuration per core– Maintainable with old less core hardware

• Changed to horizontal scalability in 2008– More and more Multi-core hardware now– Simplified queue priority configuration per core– Reduce administrative overhead

Page 7: Condor at Brookhaven

Condor Policy for ATLAS (old)

Page 8: Condor at Brookhaven

ATLAS Condor configuration (old)

Page 9: Condor at Brookhaven

Condor Policy @ BNL

Page 10: Condor at Brookhaven

ATLAS Condor configuration (new)

Page 11: Condor at Brookhaven

Condor Queue Usage

Page 12: Condor at Brookhaven

Job Slot Occupancy (RACF)

• Left-hand plot is for 01/2007 to 06/2007.• Right-hand plot is for 06/2007 to 05/2008.• Occupancy remained at 94% between the two periods.

Page 13: Condor at Brookhaven

Job Statistics (2008)

• Condor usage by RHIC experiments increased by 50% (in terms of number of jobs) and by 41% (in terms of cpu time) since 2007.

• PHENIX executed ~50% of its jobs in the general queue.• General queue jobs amounted to 37% of all RHIC Condor jobs during

this period.• General queue efficiency increased from 87% to 94% since 2007.

Page 14: Condor at Brookhaven

Near-Term Plans

• Continue integration of Condor with Xen virtual systems.

• OS upgrade to 64-bit SL5.x – any issues with Condor?

• Condor upgrade from 6.8.5 to stable series 7.2.x• Short on manpower – open Condor admin position

at BNL. If interested, please talk to Tony Chan.

Page 15: Condor at Brookhaven

Condor-G Grid job submission

• BNL, as USATLAS Tier1, provides support to the ATLAS PanDA production system.

PanDA Job Flow

Page 16: Condor at Brookhaven

• One critical service is to maintain PanDA autopilot submission using Condor-G – Very large number (~15000) of current pilot jobs

as a single user– Need to maintain very high submission rate

• Autopilot attempts to always keep a set number of pending jobs in every queue of every remote USATLAS production sites

– Three Condor-G submit hosts in production• Quad-core Intel Xeon E5430 @ 2.66GHz, 16G Memory

and two 750GB SATA drives (mirrored disks)

Page 17: Condor at Brookhaven

• We work closely with condor team to tune Condor-G for better performance. Many improvements have been implemented and suggested by Condor team.

Weekly OSG Gratia Job Count Report for USATLAS VO

Page 18: Condor at Brookhaven

New Features and Tuning of Condor-G submission

(not a complete list)

Page 19: Condor at Brookhaven

• Gridmanager publishes resources classads to collector, users can easily query and get the grid job submission status to all remote resources.

$> condor_status -grid

Name Job Limit Running Submit Limit In Progress gt2 atlas.bu.edu:211 2500 376 200 0 gt2 gridgk04.racf.bn 2500 1 200 0 gt2 heroatlas.fas.ha 2500 100 200 0 gt2 osgserv01.slac.s 2500 611 200 0 gt2 osgx0.hep.uiuc.e 2500 5 200 0 gt2 tier2-01.ochep.o 2500 191 200 0 gt2 uct2-grid6.mwt2. 2500 1153 200 0 gt2 uct3-edge7.uchic 2500 0 200 0

Page 20: Condor at Brookhaven

• Nonessential jobs– Condor assumes every job is important, it carefully

holds and retries• Pile-up of held jobs often clogs condor-g, prevents

it from submitting new jobs – A new job attribute , Nonessential, is introduced.

• Nonessential jobs will be aborted instead of being put on hold.

– Suited for “pilot” jobs• pilots are job sandbox, not real job payload. Pilots

themselves are not as essential as real jobs.• Job payload connects to PanDA server through its

own channel. PanDA server knows their status and can abort them directly if needed.

Page 21: Condor at Brookhaven

• GRID_MONITOR_DISABLE_TIME– New configurable condor-g parameter

• Controls how long condor-g waits, after a grid monitor failure, before submitting a new grid monitor job

– Old default value of 60 minutes is too long • New job submission quiet often pauses during the wait time, job

submission can not sustain at high rate level– New value is 5 minutes

• Much better submission rate seen in production. – Condor-G developers have plan to trace the underneath

Grid monitor failures, in Globus context

Page 22: Condor at Brookhaven

• Separate throttle for limiting jobmanagers based on their role – Job submission won’t compete with job

stage_out/removal• Globus bug fix

– GRAM client (inside GAHP) stops receiving connections from remote jobmanager for job status updates.

– We ran cronjob to periodically kill GAHP server to clear up the connections issue. Slower job submission rate.

– New condor-g binary compiles against newer Globus libraries, so far so good. Need more time to verify.

Page 23: Condor at Brookhaven

• Some best practices in Condor-G submission– Reduce frequency of voms-proxy renewal on the

submit host• Condor-G aggressively pushes out new proxies to all jobs• Frequent renewal of voms-proxy on the submit hosts slow

down job submission– Avoid hard-kill jobs (-forcex) from client side

• Reduces job debris on the remote gatekeepers• On the other hand, on the remote gatekeepers, we

need to more aggressively clean up debris

Page 24: Condor at Brookhaven

Near-Term Plans

Continue the good collaboration with condor team for better performance of condor/condor-g in our production environment.