batch software at jlab ian bird jefferson lab chep2000 7-11 february, 2000

23
Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Upload: noel-mccormick

Post on 19-Jan-2018

217 views

Category:

Documents


0 download

DESCRIPTION

Ian Bird / Jefferson LabCHEP Environment Computing facilities were designed to: –Handle data rate of close to 1 TB/day –1 st level reconstruction only (2 passes) Match average data rate –Some local analysis but mainly export of vastly reduced summary DSTs Originally estimated requirements: –~ 1000 SI95 –3 TB online disk –300 TB tape storage – 8 RedWood drives

TRANSCRIPT

Page 1: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Batch Software at JLAB

Ian BirdJefferson Lab

CHEP20007-11 February, 2000

Page 2: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 2

Introduction

• Environment– Farms– Data flows– Software

• Batch systems– JLAB software– LSF vs. PBS

• Scheduler

• Tape software– File pre-staging/caching

Page 3: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 3

Environment

• Computing facilities were designed to:– Handle data rate of close to 1 TB/day– 1st level reconstruction only (2 passes)

• Match average data rate – Some local analysis but mainly export of vastly

reduced summary DSTs• Originally estimated requirements:

– ~ 1000 SI95– 3 TB online disk– 300 TB tape storage – 8 RedWood drives

Page 4: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 4

Environment - real• After 1 year of production running of CLAS (largest

experiment)– Detector is far cleaner than anticipated, which means:

• Data volume is less ~ 500 GB/day• Data rate is 2.5x anticipated (2.5 kHz)• Fraction of good events larger• DST sizes are same as Raw data (!)

– Per event processing time is much longer than original estimates– Most analysis is done locally – no-one is really interested in

huge data exports• Other experiments also have large data rates (for short

periods)

Page 5: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 5

Computing implications

• CPU requirement is far greater– Current farm is 2650 SI95 and will double this year

• Farm has a big mixture of work– Not all production – “small” analysis jobs too– We make heavy use of LSF hierarchical scheduling

• Data access demands are enormous – DSTs are huge, many people, frequent accesses– Analysis jobs want many files

• Tape access became a bottleneck– Farm can no longer be satisfied

Page 6: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

JLab Farm Layout

Gigabit Ethernet

QuadSUN E4000

STK RedwoodSTK RedwoodTape DrivesTape Drives

Fast EthernetGigabit EthernetSCSI2 FWDSCSI2 UWD/S

Fast Ethernet

DualPII 450MHz

Qty. 20

18GBUWS

DualPII 400MHz

Qty. 20

18GBUWS

Cisco Cat 5500

QuadSUN E3000

400GB

stage

150GB200GB

stage

MetaStorSH7400

File Server

3TBUWD

work

MetaStorSH7400

File Server

3TBUWD

work

Plan - FY 2000

STK 9840STK 9840Tape DrivesTape Drives

FARM SYSTEMSFARM SYSTEMS

MASS STORAGEMASS STORAGESERVERSSERVERS

WORK FILE SERVERSWORK FILE SERVERS

Cisco 2900

Gigabit Ethernet

Cisco 2900

DualPIII 500MHz

Qty. 25

18GBUWS

DualPIII 650MHz

Qty. 25

18GBUWS

DualSun Ultra2

400GBUWD

DualSun Ultra2

400GBUWD

CACHE FILE SERVERSCACHE FILE SERVERS

DualSun Ultra2

400GBUWD

DualSun Ultra2

400GBUWD

Cisco 2900

DualPII 300MHz

Qty. 10

18GBFWD

Page 7: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 7

Other farms

• Batch farm – 180 nodes -> 250

• Lattice QCD– 20 node Alpha (Linux) cluster– Parallel application development– Plans (proposal) for large 256 node cluster

• Part of larger collaboration• Group want a “meta-facility”

– Jobs run on least loaded cluster (wide area scheduling)

Page 8: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 8

Additional requirements

• Ability to handle and schedule parallel jobs (MPI)

• Allow collaborators to “clone” the batch systems and software– Allow inter-site job submission– LQCD is particularly interested in this

• Remote data access

Page 9: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 9

Components

• Batch software– Interface to underlying batch system

• Tape software– Interface to OSM, overcome limitations

• Data caching strategies– Tape staging– Data caching– File servers

Page 10: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 10

Batch software

• A layer over the batch management system – Allow replacement of batch system LSF, PBS

(DQS)– Constant user interface no matter what the

underlying system is– Batch farm can be managed by the management

system (e.g. LSF)– Build in a security infrastructure (e.g GSI)

• Particularly to allow remote access securely

Page 11: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Batch control systemLSF, PBS, DQS, etc.

Job submission system

Submission interface

Database

Query interface

User processesSubmission, query, statistics

Batch processors

Batch system - schematic

Page 12: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 12

Existing batch software

• Has been running for 2 years– Uses LSF– Multiple jobs – parameterized jobs (LSF now has job

arrays, PBS does not have this)– Client is trivial to install on any machine with a JRE –

no need to install LSF, PBS etc.• Eases licensing issues• Simple software distribution• Remote access

– Standardized statistics and bookkeeping outside of LSF• MySQL based

Page 13: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 13

Existing software cont.

• Farm can be managed by LSF– Queues, hosts, scheduler etc.

• Rewrite in progress to:– Add PBS interface (and DQS?)– Security infrastructure to permit authenticated

remote access– Clean up

Page 14: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 14

PBS as alternative to LSF

• PBS (Portable Batch System – NASA)– Actively developed– Open, freely available– Handles MPI (PVM)– User interface very familiar to NQS/DQS users– Problem (for us) was the (lack of a good) scheduler

• PBS provides only a trivial scheduler, but• Provides mechanism to plug in another• We were using hierarchical scheduling in LSF

Page 15: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 15

PBS scheduler• Multiple stages (6), can be used or not as required, in

arbitrary order– Match making – matches requirements to system resources– System priority (e.g. data available)– Queue selection (which queue runs next)– User priority– User share: which user runs next, based on user and group

allocations and usage– Job age

• Scheduler has been provided to PBS developers for comments – and is under test

Page 16: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 16

Mass storage

• Silo – 300 TB Redwood capacity– 8 Redwood drives– 5 (+5) 9840 drives– Managed by OSM

• Bottleneck:– Limited to a single data mover– That node has no capacity for more drives

– 1 TB tape staging RAID disk• 5 TB of NFS work areas/caching space

Page 17: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 17

Solving tape access problems

• Add new drives – 9840’s– Requires 2nd OSM instance

• Transparent to user

• Eventual replacement of OSM• Transparent to user

• File pre-staging to the farm• Distributed data caching (not NFS)• Tools to allow user optimization• Charge for (prioritize) mounts

Page 18: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 18

OSM

• OSM has several limitations (and is no longer supported)– Single mover node is most serious

• No replacement possible yet• Local tapeserver software solves many of

these problems for us– Simple remote clients (Java based) – do not

need OSM except on server

Page 19: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 19

Tape access software

• Simple put/get interface,– Handles multiple files, directories etc.

• Can have several OSM instances, but a unique file catalog, transparent to user– System fails over between servers

• Only way to bring 9840’s on line• Data transfer is network (socket) copy in Java• Allows a scheduling/user allocation algorithm to be

added to tape access• Will permit “transparent” replacement of OSM

Page 20: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 20

Data pre-fetching & caching• Currently

– Tape – stage disk – network copy to farm node local disk– Tape – stage disk – NFS cache – farm

• But this can cause NFS server problems

• Plan:– Dual solaris nodes with

• ~ 350 GB disk (RAID 0)• Gigabit ethernet• Provides large cache for farm input

– Stage out entire tapes to cache• Cheaper than staging space, better performance than NSF• Scaleable as the farm grows

Page 21: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

JLab Farm Layout

Gigabit Ethernet

QuadSUN E4000

STK RedwoodSTK RedwoodTape DrivesTape Drives

Fast EthernetGigabit EthernetSCSI2 FWDSCSI2 UWD/S

Fast Ethernet

DualPII 450MHz

Qty. 20

18GBUWS

DualPII 400MHz

Qty. 20

18GBUWS

Cisco Cat 5500

QuadSUN E3000

400GB

stage

150GB200GB

stage

MetaStorSH7400

File Server

3TBUWD

work

MetaStorSH7400

File Server

3TBUWD

work

Plan - FY 2000

STK 9840STK 9840Tape DrivesTape Drives

FARM SYSTEMSFARM SYSTEMS

MASS STORAGEMASS STORAGESERVERSSERVERS

WORK FILE SERVERSWORK FILE SERVERS

Cisco 2900

Gigabit Ethernet

Cisco 2900

DualPIII 500MHz

Qty. 25

18GBUWS

DualPIII 650MHz

Qty. 25

18GBUWS

DualSun Ultra2

400GBUWD

DualSun Ultra2

400GBUWD

CACHE FILE SERVERSCACHE FILE SERVERS

DualSun Ultra2

400GBUWD

DualSun Ultra2

400GBUWD

Cisco 2900

DualPII 300MHz

Qty. 10

18GBFWD

Page 22: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 22

File pre-staging

• Scheduling for pre-staging is done by the job server software– Splits/groups jobs by tape (could be done by user)– Makes a single tape request– Holds jobs while files are staged– Implemented by batch jobs that release held jobs– Released jobs with data available get high priority– Reduces job slots blocked by jobs waiting for data

Page 23: Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Ian Bird / Jefferson Lab CHEP 2000 23

Conclusions

• PBS is a sophisticated and viable alternative to LSF

• Interface layer permits – use of same jobs on different systems – user

migration – Add features to batch system