martin bly ral tier1/a ral tier1/a report hepsysman - july 2004 martin bly / andrew sansum

21
Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum

Upload: gregory-jenkins

Post on 02-Jan-2016

229 views

Category:

Documents


3 download

TRANSCRIPT

Martin Bly

RAL Tier1/A

RAL Tier1/A Report

HepSysMan - July 2004

Martin Bly / Andrew Sansum

1/2 July 2004 2Martin Bly

RAL Tier1/A

Overview

• Hardware

• Network

• Experiences / Challenges

• Management issues

1/2 July 2004 3Martin Bly

RAL Tier1/A

Tier1 in GRIDPP2 (2004-2007)

• The Tier-1 Centre will provide GRIDPP2 with a large computing resource of a scale and quality that can be categorised as an LCG Regional Computing Centre

• January 2004 – GRIDPP2 confirm RAL to be host for Tier1 Service– GRIDPP2 to commence September 2004

• Tier1 Hardware budget: – £2.3M over 3 years

• Staff– Increase from 12.1 to 16.5 by September

1/2 July 2004 4Martin Bly

RAL Tier1/A

Current Tier1 Hardware

• CPU– 350 dual Processor Intel – PIII and Xeon servers mainly rack mounts– About 400KSI2K– RedHat 7.3– P2/450 tower units decommissioned April 04– RH72 and Solaris batch services to be phased out this year

• Disk Service – mainly “standard” configuration– Dual Processor Server– Dual channel SCSI interconnect– External IDE/SCSI RAID arrays (Accusys and Infortrend)– ATA drives (mainly Maxtor)– About 80TB disk– Cheap and (fairly) cheerful

• Tape Service– STK Powderhorn 9310 silo with 8 9940B drives

1/2 July 2004 5Martin Bly

RAL Tier1/A

New Hardware

• 256 x dual Xeon HT/2800GHz@533MHz– 2GB RAM (32 with 4GB RAM), 120GB HDD, 1Gb

NIC: 8 racks.• 20 disk servers with two 4TB IDE/SCSI arrays: 5 racks

– Infortrend EonStore A16U-G1A units, each with 16 x WD 250GB SATA HDD – 4TB/array raw capacity

– Servers: dual Xeon HT/2800@533MHz, 2GB RAM, dual 120GB SATA system disks, dual 1Gb/s NIC

– 160Tb raw, ~140TB available (RAID5) • Delivered June 15th, now running commissioning tests

1/2 July 2004 6Martin Bly

RAL Tier1/A

1/2 July 2004 7Martin Bly

RAL Tier1/A

Next Procurement

• Need in production by January 2005– Original schedule of December delivery seems late– Will have to start very soon– Less chance for testing / new technology

• Exact proportions not agreed, but …– 400 KSI2K (300-400 CPUs)– 160TB disk– 120TB tape??– Network infrastructure?– Core servers (H/A??)– RedHat?

• Long range plan needs reviewing – also need long range experiment requirements so as to plan environment updates.

1/2 July 2004 8Martin Bly

RAL Tier1/A

CPU Capacity

0

500

1000

1500

2000

2500

3000

2002 2003 2004 2005 2006 2007

SpecINT 2000

Number of CPUs

1/2 July 2004 9Martin Bly

RAL Tier1/A

Tier1 Disk Capacity (TB)

0

100

200

300

400

500

600

700

800

900

1000

2002 2003 2004 2005 2006 2007

Disk (TB)

Tape(TB)

1/2 July 2004 10Martin Bly

RAL Tier1/A

High Impact Systems

• Looking at replacement hardware for high impact systems:– /home/csf, /rutherford file systems– Mysql servers– AFS cell– Front end / UI hosts– Data movers– NIS master, Mail server

• Replacing mix of Solaris, Tru64 Unix and AIX servers with Linux – consolidation of expertise

• Migrate AFS to OpenAFS and then K5.

1/2 July 2004 11Martin Bly

RAL Tier1/A

Network

Firewall

Site Router

Production Subnet Test Subnet

Superjanet

ServersWorkers

Test network (eg MBNG)

Server

Servers

Workers

ProductionProductionVLANVLAN

TestTestVLANVLAN

SiteSiteRoutableRoutableNetworkNetwork

Rest ofSite

1/2 July 2004 12Martin Bly

RAL Tier1/A

Network

Firewall

Site Router

Tier1 Network

SuperJanet

ServersWorkers

Test network (eg MBNG)

Server

Servers

Workers

TestTestVLANVLAN

ProductionProductionVLANVLAN

Rest ofSite

1/2 July 2004 13Martin Bly

RAL Tier1/A

UKlight

• Connection to RAL in September• Funded to end 2005 after which probably

merges with SuperJanet 5• 2.5Gb/s now 10Gb/s from 2006• Effectively dedicated light path to CERN• Probably not for Tier1 production but suitable for

LCG Data challenges etc, building experience for SuperJanet upgrade.

• UKLight -> Starlight

1/2 July 2004 14Martin Bly

RAL Tier1/A

Forthcoming Challenges

• Simplify service – less “duplication”• Improve storage management• Deploy new Fabric Management• RedHat Enterprise 3 upgrade• Network upgrade/reconfigure????• Another procurement/install• Meet challenge of LCG – professionalism• LCG Data Challenges• …

1/2 July 2004 15Martin Bly

RAL Tier1/A

Clean up Spaghetti Diagram

• How to phase out “Classic” service ..

• Simplify Interfaces: Less GRIDS “More is not always better”

1/2 July 2004 16Martin Bly

RAL Tier1/A

Storage: Plus and Minus

• ATA and SATA drives

• External RAID arrays

• SCSI interconnect

• Ext2 file system

• Linux O/S• NFS/Xrootd/http/gridftp/bbftp/srb/….

• NO SAN

• No management layer

• NO HSM

• 2.5% failure per annum - OK

• Good architecture, choose well

• Surprisingly unreliable: change

• OK – but need journal: XFS?

• Move to Enterprise 3

• Must have SRM

• Need SAN (Fibre or iSCSI …)

• Need virtualisation/DCACHE..

• ????

1/2 July 2004 17Martin Bly

RAL Tier1/A

Benchmarking

• Work by George Prassas on various systems including a 3ware/SATA RAID5 system.

• Tuning gains extra performance on RH variants• Performance of RHEL3 NFS servers and disk

I/O not special despite tuning, c/w RH73• Considering buying SPEC suite to benchmark

everything.

1/2 July 2004 18Martin Bly

RAL Tier1/A

Fabric Management

• Currently run:– Kickstart – cascading config files, implementing PXE– SURE exception monitoring– Automate – automatic interventions

• Running out of steam with old systems …– “Only” 800 systems – but many, many flavours– Evaluating Quator – no obvious alternatives –

probably deploy– Less convinced by Lemon – bit early – running

Nagios in parallel

1/2 July 2004 19Martin Bly

RAL Tier1/A

Yum / Yumit

• Kickstart scripts now use Yum to bootstrap systems to latest updates

• Post-install config now uses Yum wherever possible for local additions

• Yumit:– Nodes use Yum to check their status very night and

report to central database– Web interface to show farm status– Easy to see which nodes need updating.

• Machine ownership tagging, port monitoring project

1/2 July 2004 20Martin Bly

RAL Tier1/A

Futures

• Storage Architectures– iSCSI, Fibre, dCache– Need to be more sophisticated to allow

reallocation of available space• CPUs

– Xeon, Opteron, Itanium, Intel 64bit x86 architecture

• Network– Higher speed interconnect, iSCSI

1/2 July 2004 21Martin Bly

RAL Tier1/A

Conclusions• After several years of relative stability must start re-

engineering many Tier1 components.• Must start to rationalise – support limited set of interfaces,

operating systems, testbeds … simplify so we can do less better

• LCG becoming a big driver– Service commitments– Increase resilience and availability– Data challenges and move to steady state

• Major reality check in 2007!