Download - Site Report: The RHIC Computing Facility
![Page 1: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/1.jpg)
Site Report: The RHIC Computing Facility
HEPIX – Amsterdam
May 19-23, 2003
A. Chan
RHIC Computing Facility
Brookhaven National Laboratory
![Page 2: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/2.jpg)
Outline
Background Mass Storage Central Disk Storage Linux Farms Software Development Monitoring Security Other services Summary
![Page 3: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/3.jpg)
Background
Brookhaven National Lab (BNL) is a U.S. gov’t funded multi-disciplinary research laboratory
RCF formed in the mid-90’s to address computing needs of RHIC experiments
Became U.S. Tier 1 Center for ATLAS in late 90’s
RCF is a multi-purpose facility (NHEP and HEP)
![Page 4: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/4.jpg)
Background (continued)
Currently 25 staff members (need more)
RHIC first collisions in 2000, now in year 3 of operations
5 RHIC experiments (BRAHMS, PHENIX, PHOBOS, PP2PP and STAR)
![Page 5: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/5.jpg)
Mass Storage
4 StorageTek tape silos managed via HPSS (9940A and 9940B )
Peak raw data rate to silos 350 MB/s (can do better)
Peak data rate to/from Linux Farm 180 MB/s (can do better)
Experiments have accumulated 618 TB of raw data (capacity for 5x more)
5 staff members oversee Mass Storage operations
![Page 6: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/6.jpg)
The Mass Storage System (1)
![Page 7: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/7.jpg)
The Mass Storage System (2)
![Page 8: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/8.jpg)
Central Disk Storage
24 Sun E450 servers running Solaris 8
140 TB of disks managed by Sun servers via Veritas
Fast access to processed (DST) data via NFS (back-up in HPSS)
Aggregate 600 MB/s data rate to/from Sun servers on average
5 staff members oversee Central Disk Storage operations
![Page 9: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/9.jpg)
Central Disk Storage (1)
![Page 10: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/10.jpg)
Central Disk Storage (2)
![Page 11: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/11.jpg)
Linux Farms
Provide the majority of CPU power in the RCF
Used for mass processing of RHIC data
Listed as 3rd largest cluster according to http://www.clusters500.org
5 staff members oversee all Linux Farm operations
![Page 12: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/12.jpg)
Linux Farm Hardware
Built with commercially available Intel-based servers
1097 rack-mounted, dual CPU servers
917,728 SpecInt2000
Reliable (0.0052 hardware failures/month-machine –about 6 failures/month at current size)
![Page 13: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/13.jpg)
The growth of the Linux Farm
0100200
300400
500600
700800
9001000
1999 2000 2001 2002 2003
KSpecInt2000
![Page 14: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/14.jpg)
The Linux Farm in the RCF (1)
![Page 15: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/15.jpg)
The Linux Farm in the RCF (2)
![Page 16: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/16.jpg)
Linux Farm Software
RedHat 7.2 (RHIC) and 7.3 (ATLAS)
Image installed with Kickstart
Support for compilers (gcc, PGI, Intel) and debuggers (gdb, Totalview, Intel)
Support for network file systems (AFS, NFS)
![Page 17: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/17.jpg)
Linux Farm Software (continued)
Support for LSF and RCF-designed batch software
System administration software to monitor & control hardware, software and infrastructure
GRID-like software (Ganglia, Condor, GLOBUS, etc)
Scalability an important operational requirement
![Page 18: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/18.jpg)
Batch jobs in the Linux Farm (1)
0
5000
10000
15000
20000
25000
30000
35000
Tota
l Job
s Su
bmitt
ed/M
onth
1999 2000 2001 2002 2003
Year
CRS Batch Job Statistics
BRAHMSPHENIXPHOBOSSTAR
![Page 19: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/19.jpg)
Batch Jobs in the Linux Farm (2)
0
20
40
60
80
100
Effic
ienc
y
1999 2000 2001 2002 2003
Year
CRS Batch Job Statistics
BRAHMSPHENIXPHOBOSSTAR
![Page 20: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/20.jpg)
Software Development
GRID-like services for RHIC and ATLAS
GRID monitoring tools
GRID user management issues
4 staff members involved
![Page 21: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/21.jpg)
The USATLAS GRID Testbed
4/15/034/15/03CHEP 03, La Jolla 1
Internet
HPSS
BNL US ATLAS Grid Configuration
Submit Grid Jobs
LSFServer1 LSF
Server2
GatekeeperJob manager
DisksGrid Job Requests
Globus client
2TB
30MB/S
atlas00
afs04,05
amds04
gridftp serverGlobus Replica
catalog
GridFtp
GIIS ServerGrid Status
![Page 22: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/22.jpg)
GRID Monitoring
4/15/034/15/03CHEP 03, La Jolla 6
Monitoring Framework
MonitoringDatabase
(ODBC+MYSQL)Or RRD
DB Info. ProvidersData Collectors
Aggregate Service Index
(GIIS)Grid-View(Web Server)
Information Provider (GRIS)
Information Provider (GRIS)
Information Provider (GRIS)
Information Provider (GRIS)
Grid-info-search
Server HPSSNetwork
Computing Nodes
Sensor Sensor Sensor Sensor
![Page 23: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/23.jpg)
GRID User Management(1)
4/15/034/15/03CHEP 03, La Jolla 7
VirtualOrganization
GUMS: A scalable Grid User Management System
User info
User info
UNM
![Page 24: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/24.jpg)
GRID User Management (2)
4/15/034/15/03CHEP 03, La Jolla 8
Schematic Diagram
VO User Registry
Database
Regional Registration
Authority?
Local Registration
Authority
VO #3 …
VO #2
Database
Site User Info
DatabaseLocal Policy
Local Accont
Managementgrid-mapfile
Site
Push
Pull
Push
![Page 25: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/25.jpg)
Monitoring
Mix of open-source, RCF-designed and vendor-provided monitoring software
Persistency and fault-tolerant features
Near real-time information
Scalability requirements
![Page 26: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/26.jpg)
Mass Storage Monitoring
![Page 27: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/27.jpg)
Central Data Storage Monitoring
![Page 28: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/28.jpg)
Linux Farm Monitoring
![Page 29: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/29.jpg)
Batch Job Control & Monitoring
![Page 30: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/30.jpg)
Infrastructure Monitoring
![Page 31: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/31.jpg)
Security
Firewall to minimize unauthorized access
Most servers closed to direct, external access
User access through security-enhanced gateway systems
Security in the GRID-environment a big challenge
![Page 32: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/32.jpg)
Security at the RCF
![Page 33: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/33.jpg)
Other Services
Limited printer support
Off-site data transfer services (bbftp, rftp, etc)
Nightly backups of critical file systems
![Page 34: Site Report: The RHIC Computing Facility](https://reader030.vdocuments.mx/reader030/viewer/2022032804/56812af8550346895d8edf27/html5/thumbnails/34.jpg)
Summary
Implementation of GRID-like services increasing
Hardware & software scalability more important as RCF grows
Security issues in the GRID-era an important issue