caspur site report andrei maslennikov group leader - systems ral, april 1999

12
CASPUR Site Report Andrei Maslennikov Group Leader - Systems RAL, April 1999

Upload: darleen-wright

Post on 29-Jan-2016

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CASPUR Site Report Andrei Maslennikov Group Leader - Systems RAL, April 1999

CASPUR Site Report

Andrei Maslennikov Group Leader - Systems

RAL, April 1999

Page 2: CASPUR Site Report Andrei Maslennikov Group Leader - Systems RAL, April 1999

A.Maslennikov - HEPiX - RAL 99 2

Will be shortly covered:

• Central computers• Other nodes• Network• Distributed storage• Tape-related systems• CASPUR and HEP• Gentes/Ateneo project• Short-term plans

Page 3: CASPUR Site Report Andrei Maslennikov Group Leader - Systems RAL, April 1999

A.Maslennikov - HEPiX - RAL 99 3

Central computers

• Alpha SMP Cluster 4100 - 28 processors - DU 4.0d interactive (front-end) : 1 x 400Mhz/2Gb parallel batch (LSF) : 4 x 400Mhz/1Gb + 2 x 600Mhz/2Gb

1999: 20 more EV6 processors (or upgrade to), 32-proc “wildfire”?

• Sun SMP - 22 processors - Solaris 2.6 interactive + parallel batch (LSF) : 1 x 3500/336Mhz/2Gb (8 processors) parallel batch (LSF) : 1 x 4500/336Mhz/3.6GB (14 processors) 1999: waiting for new SMP models

• IBM SP2 - 32 processors - AIX 4.3.2++/PSSP2.4++ interactive : 4 thin nodes (390) serial batch (LSF) : 12 thin nodes parallel batch+interactive (EASY) : 16 thin nodes

1999: waiting for SP3 offer (need SMP nodes with 4-16 proc)

Page 4: CASPUR Site Report Andrei Maslennikov Group Leader - Systems RAL, April 1999

A.Maslennikov - HEPiX - RAL 99 4

Other nodes• Some 200 UNIX nodes under our direct supervision (all UNIX flavours, single nodes and clusters).

• Around 100 PCs running Windows and Linux.

• Worth mentioning: - Linux Beowulf Cluster (10 PPro 200 + 4 PII 400) (MPI with GAMMA protocol on Digital FE cards)

- Graphics nodes: 2 Alpha 533au(2) with 4D51T and 4D60T cards with 64 MB of texture memory;

- 2 Power-3 biprocessor AIX nodes

Page 5: CASPUR Site Report Andrei Maslennikov Group Leader - Systems RAL, April 1999

A.Maslennikov - HEPiX - RAL 99 5

Network

• In 1998 our LAN became fully switched, currently we have around 100 100baseT switch ports.

• Switch hardware: several Cabletron and Compaq switches interconnected via Gigabit Ethernet; we also use virtual LANs

• Principal nodes are on FDDI (22 DEC GigaSwitch ports)

• Planning to try Gigabit Ethernet at host level, few GE cards are already under test on Sun and Linux

Page 6: CASPUR Site Report Andrei Maslennikov Group Leader - Systems RAL, April 1999

A.Maslennikov - HEPiX - RAL 99 6

Distributed Storage

• TCP/IP-less datastore with true data sharing across platforms is not yet available. So we are still investing in both NFS and AFS solutions.

• NFS is mainly used as a store for large data files, and as an element of the Staging System.

• AFS is used for home directories and as a store for collections of various ready-to-run software. We currently run 6 cells with some 300 Gb online, also over WAN.

Page 7: CASPUR Site Report Andrei Maslennikov Group Leader - Systems RAL, April 1999

A.Maslennikov - HEPiX - RAL 99 7

NFS: one more Filer

• Current NFS Server: F540 Network Appliance Filer with 150 Gb of formatted RAID space on FE and FDDI.

• Just ordered: another Filer (F760/600Mhz/1Gb) with 300 Gb of RAID disk and GE/FDDI network interfaces - 3 times more NFSops/sec than F540 - allows for clustering (better scaleability)

Page 8: CASPUR Site Report Andrei Maslennikov Group Leader - Systems RAL, April 1999

A.Maslennikov - HEPiX - RAL 99 8

AFS: news since last report

• Purchased AFS Source Code. This allowed us to compile AFS on Solaris/Intel

(thanks to Rainer Toebbicke /CERN who proved that this is possible).

• University of Rome-3 went Solaris/Intel also for DB (3 servers).

• Abdus Salam Centre for Theoretical Physics joined our AFS License.

• Upgraded central servers (now 3 Alpha 500au on FE and FDDI). Proved to be very stable and performant.

• We go Fibre Channel! - Just ordered 280 Gb of RAID-5/FC from Artecon - Dual active-active controllers - Gadzoox hubs and HBAs from Genroco - This system will be replacing most of the on-site AFS disks.

Page 9: CASPUR Site Report Andrei Maslennikov Group Leader - Systems RAL, April 1999

A.Maslennikov - HEPiX - RAL 99 9

Tape access • During l998, all services which use the tape robotics operated steamlessly: AFS and ADSM backups, staging.

• Some 80 Gb were deeply archived via the Staging System. With F540 Filer we stage at 4+ Mbytes/sec, almost at the limit of Timberline tape.

• In 1999 we plan to replace the STK Silo with 9840 library: - doubles the tape speed - BABAR-compliant - smaller maintenance fees

- frees the physical space in computer centre.

Page 10: CASPUR Site Report Andrei Maslennikov Group Leader - Systems RAL, April 1999

A.Maslennikov - HEPiX - RAL 99 10

CASPUR and HEP

• Geographical AFS system support for INFN

• Regular ASIS mirroring over WAN to 17 INFN Sections across Italy

• Linux system support for INFN. - Linux tree maintenance - AFS-enabled bootable Linux CDs at the latest patchlevel.

• Software collaboration with CERN (ASIS, Linux, AFS).

• Regional Centre for BABAR: fullscale system support.

Page 11: CASPUR Site Report Andrei Maslennikov Group Leader - Systems RAL, April 1999

A.Maslennikov - HEPiX - RAL 99 11

Gentes/Ateneo projectScope: provide a turnkey computing environment for a generic research organization / university department.

• Fully Intel-based• Desktop on Linux and/or WNT• Just 4 Intel machines make into a core:

- Entry Point Linux host with a firewall- AFS fileserver on Solaris- Management Linux host with YARD dbms and https tools- General Services (mail,web,print,efax,ppp,majordomo etc) on a single Linux (SMP) machine

• WNT/Linux AFS-based integration: single password, common filestore, YARD ODBC• Client installation: cloning with Norton Ghost• Progressing well. First presentation: June 1999.

Page 12: CASPUR Site Report Andrei Maslennikov Group Leader - Systems RAL, April 1999

A.Maslennikov - HEPiX - RAL 99 12

Some short-term plans • Compile AFS 3.5 Server on Solaris/Intel

- will improve performance for en masse serving of small files

• Test FC on Linux (QLogic card)

- first to provide a RAID space for mail spool- next to take a look at Global File System (w. Seagate disks)

• Test FC on AIX- CASPUR will be probably asked to propose a set of high availability services for PCM; IBM DFS with FC RAID might make into a good combination.

• Try LoadLeveler on Solaris- LSF becomes too expensive (they charge per CPU)