cern it department ch-1211 genève 23 switzerland t hepix report helge meinhard, david gutierrez,...

16
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/ HEPiX Report Helge Meinhard, David Gutierrez, Jérôme Belleman / CERN-IT Technical Forum/Computing Seminar 16 September 2012

Upload: abigayle-thornton

Post on 25-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

HEPiX Report

Helge Meinhard, David Gutierrez,Jérôme Belleman / CERN-IT

Technical Forum/Computing Seminar16 September 2012

Outline

• Meeting organisation; site reports; computing; miscellaneous (Helge Meinhard)

• Security and networking; storage(David Gutierrez)

• IT infrastructure; grids, clouds and virtualisation (Jérôme Belleman)

HEPiX report – Helge.Meinhard at cern.ch – 16-Nov-2012

HEPiX

• Global organisation of service managers and support staff providing computing facilities for HEP

• Covering infrastructure and all platforms of interest (Unix/Linux, Windows, Grid, …)

• Aim: Present recent work and future plans, share experience, advise managers

• Meetings ~ 2 / y (spring in Europe, autumn typically in North America)

HEPiX report – Helge.Meinhard at cern.ch – 16-Nov-2012

HEPiX Autumn 2012 (1)

• Held 15 – 19 October at the Institute of High Energy Physics (IHEP) of the Chinese Academy of Sciences, Beijing, People’s Republic of China– 1300 staff, 400 students, 120M $ budget– BEBC II accelerator, BES III detector; members in Belle II, CMS,

ATLAS; neutrino experiments– Particle astrophysics, theory, synchrotron lab– Tier 2 centre in LCG for Atlas and CMS

• Excellent local organisation– Gang Chen and his team made the meeting run very smoothly– Network including Wifi, video conferencing (Vidyo – 4 remote

presentations), … all working like a charm– Beijing: Growing and changing at an incredible speed

• Cars have almost entirely replaced bicycles…

• Sponsored by Huawei and Western DigitalHEPiX report – Helge.Meinhard at cern.ch – 16-Nov-2012

HEPiX Autumn 2012 (2)

• Format: Pre-defined tracks with conveners and invited speakers per track– Rich, interesting and packed agenda

• Contrary to last time, Silverman’s law applied once more – agenda was full, but not overcrowded

– Judging by number of submitted abstracts, good balance between tracks: IT infrastructure (12 talks), network and security (11 talks), computing (8 talks), grids/clouds/virtualisation (7 talks), storage and file systems (7 talks), miscellaneous (4 talks)… plus one BoF session (on batch systems) and 11 site reports

• Full details and slides:http://indico.cern.ch/conferenceDisplay.py?confId=199025

• Trip report by Alan Silverman available, toohttp://cdsweb.cern.ch/record/1485643

HEPiX report – Helge.Meinhard at cern.ch – 16-Nov-2012

HEPiX Autumn 2012 (3)

• 67 registered participants, of which 9/10 from CERN– CERN: Belleman, Cass, Fedorko, Grzywaszewski,

Gutierrez, Lopienski, Meinhard, Salter, (Silverman,) Traylen– 20 from Asia, 39 from Europe, 6 from USA, 2 from Australia– Plus some more colleagues from IHEP

• Representing 27 institutes, 2 sponsors– 9 from Asia, 15 from Europe, 2 from USA, 1 from Australia– 2 worldwide sponsor companies

• Compare with Prague (spring 2012): 97 participants, of which 12/13 from CERN; Vancouver (autumn 2011): 98 participants, of which 10/11 from CERN

HEPiX report – Helge.Meinhard at cern.ch – 16-Nov-2012

HEPiX Autumn 2012 (4)

• 60 talks, of which 13 from CERN– Compare with Prague: 74 talks, of which 22 from

CERN– Compare with Vancouver: 55 talks, of which 15 from

CERN• Next meetings:

– Spring 2013: CNAF, Bologna, Italy, 15 – 19 April• Batch systems; energy efficiency; network monitoring?;

Windows 8 etc.?

– Autumn 2013: U Michigan, Ann Arbor, US, 28 October – 01 November

– Spring 2014: Interest by LAPP Annecy (to be confirmed)

HEPiX report – Helge.Meinhard at cern.ch – 16-Nov-2012

Site reports (1): Hardware

• Only few details this time• CPU servers: same trends

– 12...48 core dual-CPU servers, 2...4 GB/core. Typical chassis: 2U Twin2; some A-brand blades (one failing blade has taken entire chassis down)

• Disk storage– External disk enclosures gaining popularity

• 4U trays with 48…60 drives becoming popular• No positive indication that SAS nearline is taking up

– A-brand’s extension disk tray has got firmware…– IBM Sonas

HEPiX report – Helge.Meinhard at cern.ch – 16-Nov-2012

Site reports (2): Hardware (cont’d)

• Tapes– An increasing number of sites mentioned T10kC

in production – LTO popular, many sites investigating (or moving

to) LTO5; some migration from LTO to T10kC• HPC

– IB still popular; two large clusters at GSI, one for computing, one for parallel file system (Lustre)

– 10GE ramping up• Odds and ends: Suppliers going bust

HEPiX report – Helge.Meinhard at cern.ch – 16-Nov-2012

Site reports (3): Software

• Storage– CVMFS now a standard service – little issues only– Increasing interest in NFS interfaces for dCache and DPM– Lustre mentioned often – works well with controlled use cases and

new, homogeneous hardware, but issues with some use cases and older hardware

– Enstore/dCache: small file aggregation in production at FNAL• OS

– GSI moving from flat to hierarchical Windows domain (domain controllers on VMs); LAL has completed move to Windows domain to IN2P3 “forest”

• Mail/calendaring services– Exchange 2003 and/or Lotus to Exchange 2010 (FNAL: 3’000

accounts total)– DESY considering alternative solutions to replace Exchange 2003:

OpenXchange, Zimbra, Zarafa

HEPiX report – Helge.Meinhard at cern.ch – 16-Nov-2012

Site reports (4): Software (cont’d)

• Virtualisation– Most sites experimenting with KVM– Some use of VMware (and complaints about cost level…) and of

HyperV– Australia: migrating from KVM to Citrix– Most sites run critical services on VMs

• Clouds– Openstack– OpenNebula

• Miscellaneous: Docuwiki, Redmine, git• Configuration management

– Puppet seems to be clear winner, still on the rise– Chef, Quattor used as well– Declining interest in cfengine (3)

HEPiX report – Helge.Meinhard at cern.ch – 16-Nov-2012

Site reports (5): Software (cont’d), Infrastructure

• Monitoring– Some sites migrating from Nagios to Icinga, one site considering

Zenoss– Ganglia used frequently for performance monitoring– PerfSONAR being deployed everywhere

• Infrastructure– A number of upgrade projects (IHEP from 800 kW to 1’800 kW)– GSI: Cube prototype working fine even at 32 deg outside– RAL: switch gear in power supply line being replaced, higher risk until

end November– LAL: Major chiller failure– FNAL: During hot summer, had to throttle down major services– DESY: During power supply maintenance, batteries on full load – some

exploded, acid on the floor… resulting in extended power outage– DESY: 20 kUSD network line card destroyed by concrete dust due to

drilling

HEPiX report – Helge.Meinhard at cern.ch – 16-Nov-2012

Computing: Batch systems (1)

• 8 talks, BoF session• Site reports: Torque/Maui for small to medium size

installations; PBSpro; GridEngine; Slurm (mentioned 3 times)• FNAL: HTCondor since 2002 for part of their facilities (CDF)

– Many features added on FNAL’s request– Main scalability concern is condor_schedd; single-threaded,

supports up to 30 k simultaneous jobs now, goal is 150 k• CERN: LSF: large installation, heterogeneous user base, 400

k jobs per day– Issues: slow response to queries and submissions, slow

dispatching, fairshare scheduling, setup complex, poorly dynamic, limited scalability

– Targeting 12’000 physical nodes, 300’000 job slots– Currently looking at Slurm, GE, Condor, LSF8– Recent work on monitoring and accounting

HEPiX report – Helge.Meinhard at cern.ch – 16-Nov-2012

Computing: Batch systems (2)

• KIT: 1000 nodes, split into two PBS instances due to PBS limitations– Tested Torque/Maui, GE; selected Univa GE– Migration started in July, to finish in December– GE: learning curve… but stable, flexible, with good support

• IN2P3-CC: Migration to Oracle GE completed in December 2011– A lot of interfacing done by IN2P3– Shadow master abandoned due to instabilities– Difficult to get job information; no native grid support– Oracle support not brilliant; difficult to get in contact with

developers; no road map for GE; only serious bugs got fixed– Getting in touch with Univa

HEPiX report – Helge.Meinhard at cern.ch – 16-Nov-2012

Computing: Batch systems (3)

• DESY Zeuthen: Added Certificate Security Protocol support to UGE

• NDGF: Slurm experience: very positive, easier and more stable than predecessors (Torque/Maui)– Defaults often not adequate, tuning needed

• INFN Bari: Testing Slurm– Tested a long list of functionalities– Scheduling powerful, but can be improved by using MOAB or

LSF scheduler– No RPM; no way to transfer output back to submission host– Rather steep learning curve– Tests with 6’000 cores and 100’000 jobs all successful, very

moderate load on master– Grid integration (Cream CE) progressing

HEPiX report – Helge.Meinhard at cern.ch – 16-Nov-2012

Miscellaneous

• CERN mobile Web site

HEPiX report – Helge.Meinhard at cern.ch – 16-Nov-2012