jefferson lab site report kelvin edwards thomas jefferson national accelerator facility newport...
TRANSCRIPT
Jefferson LabSite Report
Kelvin Edwards
Thomas Jefferson National Accelerator Facility
Newport News, Virginia USA
757-269-7770
http://cc.jlab.org
HEPiX – October, 2004
Central Computing
• Email – Distracted by SPAM problem – Evaluated and purchased MXLogic
• Offsite solution
• Filters virus/spam before getting to Lab
– Upgraded our email hardware
• Windows builds– Purchased MS Enterprise Agreement– Developed an automatic build process– Upgrading all of our systems to Windows XP– Still evaluating SP2, problems with CAD, etc.
File Server Storage
• Adaptec 2200S Raid and Linux XFS– Linux kernel 2.6 and Adaptec firmware (build 7244)
• It doesn’t work (I/O errors, etc.)
– RedHat EL3 WS kernel works fine, but no XFS support– Tested ext3 performance
• unacceptable (20MB/s read, 34MB/s write)
• XFS performance (approx 100MB/s read/write)
– Dropped back to prior Adaptec BIOS and 2.6 kernel works fine
File Server Storage (cont)
• Purchased 2 StorageTek B280 systems– 14 TB of disk space– 4 Sun V210 head units– Stable, but slow, NFS performance
• Aggregate -- 6MB/s write, 63MB/s read
• Each node -- 0.13 MB/s write, 1.4MB/s read average
File Server Storage (cont)
• Evaluating 10TB Panasas system– Tested 2 protocols (directFLOW and NFS)– No directFLOW problems– NFS finally stable at version 2.1.4c– Good performance with either
• Aggregate -- 160-185MB/s write, 100-180MB/s read
• Each node – 3.5 - 5MB/s write, 2.5 - 4.5 MB/s read
Jasmine Changes• Jasmine is Jlab’s mass storage system (disk+tape) stores ~1PB
and can routinely move 20TB/day.• Disk cache system recently rewritten for performance and
reliability– I/O load spread out over pool of many disk servers– Files belong to file groups (per experiment) with quotas– Quotas may be exceeded if there is enough disk space; allows
more flexible use of disk – Files deleted from servers in a modified LRU fashion– Files may be pinned until used by the batch farm
Jasmine changes (2)• New programmatic interfaces for
– Batch Farm (Auger)– Other services that need to move files (SRM, DAQ, LQCD
disk cache)
• More reliance on MySQL database; concurrency and load are challenging
• Writing 9940B tapes• Experiment data rates now ~30MB/sec
Auger Changes• Auger is Jlab’s Batch farm management system.• Uses LSF to run jobs, keeps accounting in a database for web
or command line presentation.• Users can submit thousands of jobs using a compact job
description that includes file retrieval and storage.• Interfaces with Jasmine to stage files to disk before the job runs
on the farm to keep CPUs busy
Jasmine & Auger Web Interface
• Java Server Pages
Projects
• Email upgrade– Still evaluating software/hardware
• Desktop systems– MacOS-X– Linux, Unix– Windows
• Power/Cooling issues– Reached limit of current Computer Room– New Computer Center to open in Jan 2006– Increased power requirements for 800 MHz FSB systems
• 1.3A to 2.1A (single CPU)• 1.6A to 2.8A (dual CPU)
– Shutdown problems with non-ACPI enabled systems