Download - RAL Tier 1/A Status

Transcript
Page 1: RAL Tier 1/A Status

Martin Bly

RAL CSF Tier 1/A

RAL Tier 1/A Status

HEPiX-HEPNT

NIKHEF, May 2003

Page 2: RAL Tier 1/A Status

Martin Bly

RAL CSF Tier 1/A

CPU Farm – Existing Hardware

• 108 dual processors (450, 600 and 1GHz)

– Up to 1GB RAM

– Desktop towers on warehouse shelves

• 156 dual processor 1400MHz PIII

– 133MHz FSB, 1Gb RAM each

– 1U rackmount, remote power switching

– RedHat 7.2

Page 3: RAL Tier 1/A Status

Martin Bly

RAL CSF Tier 1/A

New Hardware – Spring 2003 +

• 80 dual processor 1U rackmount units– 2 x 2.66GHz P4 Xeons @ 533MHz FSB– Hyper-threading– 2048Mbyte memory– 2x1Gb/s NICs (o/b)– RedHat 7.3– 3 racks, remote power switching

• Next delivery expected Summer 2003

Page 4: RAL Tier 1/A Status

Martin Bly

RAL CSF Tier 1/A

Operating Systems

• Operating Systems:– Redhat 6.2 service will close end May– Redhat 7.2 service has been in production for

Babar for 6 months.– New Redhat 7.3 service now available for

LHC/other experiments– Testing/benchmarking on new Xeon systems

• Increasing demands for security updates becoming problematic.

Page 5: RAL Tier 1/A Status

Martin Bly

RAL CSF Tier 1/A

Disk Farm – Existing Hardware

• 2002 – 26 servers, each with 2 external RAID arrays - 1.7TB disk per server, RAID 5:– Excellent performance, well balanced system– Problems with a bad batch of Maxtor drives –

many failures and high error rate – all 620 drives now replaced by Maxtor.

– Still outstanding problems with Accusys controller failing to eject bad drives from RAID set.

Page 6: RAL Tier 1/A Status

Martin Bly

RAL CSF Tier 1/A

Disk Farm – Spring 2003 +

• Recent upgrade to disk farm:– 11 dual P4 Xeon servers (2.4GHz, 1024Mb RAM, PCIx), each

with 2 Infortrend IFT-6300 arrays via Ultra160 SCSI– 12 Maxtor 200GB DiamondMax Plus 9 drives per array, RAID 5.

• Not yet in production – but a few snags:– Originally tendered Maxtor Maxline Plus II drive was found not to

exist!– Infortrend array has 2TB limit per RAID set – pushing for a

firmware update.– 11+1spare better than 2 x 6 – 5Gb over 11 systems.

• Nick White ([email protected]) for more info.

Page 7: RAL Tier 1/A Status

Martin Bly

RAL CSF Tier 1/A

New Projects

• Basic fabric performance monitoring (ganglia)

• Resource CPU accounting (based on PBS accounts/mysql)

• New CA in production

• New batch scheduler (MAUI)

• Deploy new helpdesk (May)

Page 8: RAL Tier 1/A Status

Martin Bly

RAL CSF Tier 1/A

Ganglia

• Urgently needed live performance and utilisation monitoring:– RAL Ganglia Monitoring

http://ganglia.gridpp.rl.ac.uk/• Scalable solution based on multicast• Very rapidly deployable - reasonable

support on all Tier1A Hardware• See: http://ganglia.sourceforge.net/

Page 9: RAL Tier 1/A Status

Martin Bly

RAL CSF Tier 1/A

PBS Accounting Software

• Need to keep track of system CPU and disk usage.

• Home grown PBS accounting package (Derek Ross):– Upload PBS and disk stats into MYSQL– Process with Perl DBI script– Serve via Apache

• http://www.gridpp.rl.ac.uk/stats• Contact Derek ([email protected]) for more info.

Page 10: RAL Tier 1/A Status

Martin Bly

RAL CSF Tier 1/A

MAUI / PBS

• Maui scheduler has been in production for last 4 months.

• Allows extremely flexible scheduling with many features. But ….– Not all of it works – we have done much work

with developers for fixes.– Major problem – MAUI schedules on wall

clock time – not CPU time. Had to bodge it!!

Page 11: RAL Tier 1/A Status

Martin Bly

RAL CSF Tier 1/A

New Helpdesk Software

• Old helpdesk email based/unfriendly.• With additional staff, urgently need to deploy

new solution.• Expect new system to be based on free software

– probably Request Tracker• Hope that deployed system will also meet needs

of Testbed and may also satisfy Tier 2 sites.• Expect deployment by end of May.• http://requestracker.gridpp.rl.ac.uk

Page 12: RAL Tier 1/A Status

Martin Bly

RAL CSF Tier 1/A

Outstanding issues / worries

• We have to run many distinct services.– Fermi Linux– RH 6.2/7.2/7.3…– EDG testbeds, LCG …

• Farm management is getting very complex. We need better tools and automation.

• Security is becoming a big concern again.


Top Related