cern computer centre tier sc4 planning fzk october 20 th 2005 harry.renshall@ cern.ch
TRANSCRIPT
See https://uimon.cern.ch/twiki/bin/view/LCG/ServiceChallengeFourProgress Twiki shows work in progress:
– Service Level Definition - what is required – Technical Factors - components, capacity and
constraints – LCG Service Co-ordination Meeting Status – The set of activites required to deliver the building
blocks on which SC4 can be built
Leads to our (evolving) hardware configurations for grid servers, operational procedures and staffing
Hope it will prove useful to other sites
Using existing buildings
Physical location– B513
» Main Computer Room, ~1,500m2 & 1.5kW/m2, built for mainframes in 1970, upgraded for LHC PC clusters 2003-2005.
» Second ~1,200m2 room created in the basement in 2003 as additional space for LHC clusters and to allow ongoing operations during the main room upgrade. Cooling limited to 500W/m2.
» Contains half of tape robotics (less heat/m2).
– Tape Robot building ~50m from B513» Constructed in 2001 to avoid loss of all CERN data due to an
incident in B513. Contains half of tape robotics.
Capacity today 2000 KSi2K batch – 1100 worker nodes
– Adding 2000 KSi2K December
10 STK tape silos of 6000 slots– 5 interconnected silos each in two separate buildings– Physics data split– About half of slots now occupied (after media
migration)– 50 9940B tape drives – 30 MB/sec– 200GB capacity cartridges – 6PB total
About 2 PB raw disk storage – older servers used mirrored, newer as raid.
Activities
» Physics computing services Interactive cluster - lxplus Batch computing - lxbatch Data recording, storage and management Grid computing infrastructure
» Laboratory computing infrastructure Campus networks—general purpose and technical Home directory, email & web servers (10k+ users) Administrative computing servers
Physics Computing Requirements
25,000k SI2K in 2008, rising to 56,000k in 2010– 2,500-3,000 boxes (multicore, blade … ?)– 500kW-600kW @ 200W/box.
2.5MW @ 0.1W/SI2K
6,800TB online disk in 2008, 11,800TB in 2010– 1,200-1,500 boxes,– 600kW-750kW
15PB of data per year– 30,000 500GB cartridges/year– Five 6,000 slot robots/year
Sustained data recording at up to 2GB/s– Over 250 tape drives and associated servers
Tape plans By end 2005 we will have 40 high-duty-cycle
new model tape drives and matching robotics from each of IBM (3592B) and another vendor for evaluation.
Drive data rates are expected to approach 100MB/sec
Cartridge sizes are expected to approach 500GB
Cartridge costs canonical US$120 so about 25cts/GB (compared with 60 cts/GB today).
For LHC startup operations we plan on 200 drives with these characteristics.
Grid operations servers Hardware matched to QoS requirements
– today mostly on ad-hoc older disk servers/farm PCs– Migrate immediately critical/high services to more
reliable but simple mid-range servers– Evaluate high availibility solutions to be deployed by
SC4 startup looking at:» FC San multiple host/disk interconnects» HA linux (automatic failover)» Logical volume replication» Application level replication» Ready to go spare hardware for less critical services (with
simple operational procedures)
– Objective to reach availability levels 24 by 7.
Mid-range server building block
Dual 2.8GHz Xeon, 2GB mem, 4 hot-swap 250GB disks
Current Oracle RAC cluster building blocks
Fibre-channel disks/switches infrastructure
Who
– Contract Shift Operators: 1 person 24x7
– Technician level System Administration Team» 10 team members plus 3 people for machine room
operations plus engineer level manager. 24 by 7 on-call.
– Engineer level teams for Physics computing» System & Hardware support: approx 10FTE» Service support: approx 10FTE» ELFms software: 3FTE plus students and collaborators.
~30FTE-years total investment since 2001