benigno gobbo 1 consiglio di sezione 15 marzo 2004 more than three years of compute farm benigno...

19
Benigno Gobbo Benigno Gobbo 1 Consiglio di Sezione Consiglio di Sezione 15 marzo 2004 15 marzo 2004 More than Three Years of More than Three Years of Compute Farm Compute Farm Benigno Gobbo Benigno Gobbo [email protected] [email protected] Info: Info: http://www.ts.infn.it/acid http://www.ts.infn.it/acid [email protected] [email protected]

Upload: jakayla-candler

Post on 31-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Benigno Gobbo Benigno Gobbo 11Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004

More than Three Years of More than Three Years of Compute FarmCompute Farm

Benigno GobboBenigno [email protected]@cern.ch

Info:Info:

http://www.ts.infn.it/acidhttp://www.ts.infn.it/acid

[email protected]@ts.infn.it

Page 2: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 22

RequirementsRequirements

COMPASS: High statistics - Medium event complexityCOMPASS: High statistics - Medium event complexity~ 10~ 101010 events/year events/year~ 10 “good” tracks/event~ 10 “good” tracks/event More than 200 tracking planes in non uniform magnetic fieldMore than 200 tracking planes in non uniform magnetic field Particle Identification: RICH, calorimeters, …Particle Identification: RICH, calorimeters, …

Non trivial event reconstructionNon trivial event reconstruction Production time: ~0.5 s/ 1 GHz PIII CPUProduction time: ~0.5 s/ 1 GHz PIII CPU

DATA STORAGE, PRODUCTION and ANALYSIS modelDATA STORAGE, PRODUCTION and ANALYSIS modelRaw data stored at CERN (~300 TB/year)Raw data stored at CERN (~300 TB/year)Production at CERN: up to 400 reserved batch queues (Production at CERN: up to 400 reserved batch queues (CPUs)CPUs)Monte Carlo Production and Data Analysis at Home-LabsMonte Carlo Production and Data Analysis at Home-Labs

Need of Compute Farms at Home LaboratoriesNeed of Compute Farms at Home LaboratoriesAlso due to usual CERN request of computing redistribution:Also due to usual CERN request of computing redistribution: 33% at CERN, 67% outside33% at CERN, 67% outside

Page 3: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 33

A different Computing ModelA different Computing Model

1998. Definition of a Computing Model for the post–LEP 1998. Definition of a Computing Model for the post–LEP eraeraJanuary 1998. A Task Force was established at CERN (1)January 1998. A Task Force was established at CERN (1) To achieve: agreement with time scale and requirements of To achieve: agreement with time scale and requirements of

experiments, flexibility of environment, constraints from used experiments, flexibility of environment, constraints from used commercial software, realistic assessment of costs, …commercial software, realistic assessment of costs, …

April 1998. Conclusions (Recommendations): Hybrid April 1998. Conclusions (Recommendations): Hybrid ArchitectureArchitecture using PCs for computation (preferred: Windows NT, “tolerated”: using PCs for computation (preferred: Windows NT, “tolerated”:

Linux)Linux) using at present RISC systems for I/O (legacy Unix) using at present RISC systems for I/O (legacy Unix)

1999. Evolution of the model1999. Evolution of the model Sensitive Linux improvements: now stable and better performing Sensitive Linux improvements: now stable and better performing

than Win NTthan Win NTDevelopment of “low price + good enough quality” IDE disk based Development of “low price + good enough quality” IDE disk based

PC serversPC servers

COMPASS Definitive choice:COMPASS Definitive choice:PCs for both server and computation machinesPCs for both server and computation machines(RedHat) Linux OS(RedHat) Linux OS

Page 4: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 44

The Farm HistoryThe Farm History

Sep. 2000. Approved (and above all “sponsored”!) by CSN Sep. 2000. Approved (and above all “sponsored”!) by CSN IIFinanced in two yearsFinanced in two years 200M ITL in 2000200M ITL in 2000 124k € in 2001 124k € in 2001

Oct. 2000. Definition of a schema for the farm “initial Oct. 2000. Definition of a schema for the farm “initial setup”setup”The farm has to be as much as possible compatible with the The farm has to be as much as possible compatible with the CERN oneCERN one But not CERN-dependentBut not CERN-dependent

The “initial setup” must guarantee a “production The “initial setup” must guarantee a “production environment”environment” Enough disk space (for data storage and MC production)Enough disk space (for data storage and MC production) Enough CPU power (i.e. PC clients)Enough CPU power (i.e. PC clients)

It must be scalable to the final configuration without (major) It must be scalable to the final configuration without (major) modificationsmodificationsIt must fit with approved financingIt must fit with approved financing

Page 5: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 55

History: first stepsHistory: first stepsNov. 2000. “Initial setup” decided, orders submittedNov. 2000. “Initial setup” decided, orders submitted

1 PC Server with large EIDE disk space (with 14 x 75 GB 1 PC Server with large EIDE disk space (with 14 x 75 GB EIDE disks)EIDE disks) RAID1 (mirroring) configured, it allowed RAID1 (mirroring) configured, it allowed 0.5 TB0.5 TB of (cheap) disk of (cheap) disk

storagestorage The machine was assembled by ELONEX following a CERN R&DThe machine was assembled by ELONEX following a CERN R&D

1 Sun Server with external SCSI disks ( 8 x 73 GB)1 Sun Server with external SCSI disks ( 8 x 73 GB) Configured RAID5, gave a 0.47 TB of more reliable disk storageConfigured RAID5, gave a 0.47 TB of more reliable disk storage Different OS (Solaris) and architecture (SPARC): allows better Different OS (Solaris) and architecture (SPARC): allows better

test and debugging of softwaretest and debugging of software

1 PC Supervision Server1 PC Supervision Server Nothing special: just a white-box PC with better components. Nothing special: just a white-box PC with better components.

Used as a supervisor or master in monitoring or client-server Used as a supervisor or master in monitoring or client-server softwaresoftware 12 PC Clients12 PC Clients Value white-box PC, to stay into available budgetValue white-box PC, to stay into available budget

All machines are dual processor to improve All machines are dual processor to improve performances/costsperformances/costs Well… Sun was bought as single processor (it was so expansive…) Well… Sun was bought as single processor (it was so expansive…)

and upgraded subsequentlyand upgraded subsequently

Network switch (36 100BaseT + 3 1000BaseSX ports)Network switch (36 100BaseT + 3 1000BaseSX ports)KVM switches, rack, shelves, monitor, keyboard, etc.KVM switches, rack, shelves, monitor, keyboard, etc.UPS and cooling system UPS and cooling system (thanks to A. Mansutti & S. Rizzarelli)(thanks to A. Mansutti & S. Rizzarelli)

Page 6: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 66

History. Feb. 2001: “First setup” in History. Feb. 2001: “First setup” in productionproduction

First First LinuxLinux Compute Farm locally installed and Compute Farm locally installed and completely managed by INFN personnelcompletely managed by INFN personnel

Page 7: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 77

History: the final setupHistory: the final setup

Sep. 2001. Start Farm upgrade to Final Setup Sep. 2001. Start Farm upgrade to Final Setup 1 more EIDE PC Server (with 20 x 80 GB EIDE disks)1 more EIDE PC Server (with 20 x 80 GB EIDE disks) Configured RAID1: Configured RAID1: 0.75 GB0.75 GB

Upgrade of previous EIDE Server with 6 additional 80 GB Upgrade of previous EIDE Server with 6 additional 80 GB disksdisks Now it provides Now it provides 0.72 TB0.72 TB (RAID1) (RAID1)

Upgrade of the Sun to dual processorUpgrade of the Sun to dual processorSTK Tape Library: 20 slots (can be upgraded to 40) , 2 IBM STK Tape Library: 20 slots (can be upgraded to 40) , 2 IBM Ultrium drives (can have 4 drives)Ultrium drives (can have 4 drives) It can store up to 4 TB of data. Drives transfer rate up to 30 MB/sIt can store up to 4 TB of data. Drives transfer rate up to 30 MB/s

1 Dell PC Tape Server, with 6 x 73 GB SCSI disks configured 1 Dell PC Tape Server, with 6 x 73 GB SCSI disks configured RAID 0 (striping)RAID 0 (striping) To be used with Tape Lib forming HSM systemTo be used with Tape Lib forming HSM system

19 PC clients19 PC clients white-box machines, dual 1 GHz P III white-box machines, dual 1 GHz P III

12 ports 1000BaseSX switch12 ports 1000BaseSX switchKVM switches, etc.KVM switches, etc.

Page 8: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 88

History: the 2002 “Final Setup”History: the 2002 “Final Setup”

11 Old clients:11 Old clients:MSI 694D ProMSI 694D Pro

Dual PIII 800 MHzDual PIII 800 MHz2 x 20 GB ATA Disk2 x 20 GB ATA Disk

512 MB RAM512 MB RAM

11 Old clients:11 Old clients:MSI 694D ProMSI 694D Pro

Dual PIII 800 MHzDual PIII 800 MHz2 x 20 GB ATA Disk2 x 20 GB ATA Disk

512 MB RAM512 MB RAM

19 New clients:19 New clients:Abit VP6Abit VP6

Dual PIII 1000 MHzDual PIII 1000 MHz2 x 40 GB ATA Disk2 x 40 GB ATA Disk

512 MB RAM512 MB RAM

19 New clients:19 New clients:Abit VP6Abit VP6

Dual PIII 1000 MHzDual PIII 1000 MHz2 x 40 GB ATA Disk2 x 40 GB ATA Disk

512 MB RAM512 MB RAM

3com 49003com 49003com 39003com 3900

Kvm switchKvm switch

Server SGE, DHCP, BServer SGE, DHCP, BB, …B, …Asus CUR-DLSAsus CUR-DLSDual PIII 800 MHzDual PIII 800 MHz2 x 36 GB SCSI Disk2 x 36 GB SCSI Disk512 MB RAM512 MB RAMGA620 G gigabitGA620 G gigabit

EIDE disk serverEIDE disk serverIntel L440 GX+Intel L440 GX+Dual PIII 700 MHzDual PIII 700 MHz2 x 15 GB ATA disk2 x 15 GB ATA disk14 x 75 GB ATA disk14 x 75 GB ATA disk6 x 80 GB ATA disk6 x 80 GB ATA diskGA620 G gigabit GA620 G gigabit

EIDE disk serverEIDE disk serverIntel STL2Intel STL2Dual PIII 866 MHzDual PIII 866 MHz2 x 20 GB ATA disk2 x 20 GB ATA disk20 x 80 GB ATA disk20 x 80 GB ATA diskGA620 G gigabit GA620 G gigabit

Tape LibraryTape LibrarySTK L40 20 slotSTK L40 20 slot2 x IBM Ultrium2 x IBM Ultrium

Tape/disk serverTape/disk serverDell PowerEdge 4400Dell PowerEdge 4400Dual Xeon 1 GHzDual Xeon 1 GHz2 x 36 GB SCSI RAID12 x 36 GB SCSI RAID16 x 73 GB SCSI RAID06 x 73 GB SCSI RAID0

SCSI disk serverSCSI disk serverSun Blade 1000Sun Blade 1000Dual SparcIII 750 MHzDual SparcIII 750 MHz18 GB SCSI FC disk18 GB SCSI FC disk8 x 73 GB SCSI RAID58 x 73 GB SCSI RAID5

CRD-5440CRD-5440

Page 9: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 99

History: up to now and in the near History: up to now and in the near futurefuture

2002 - 2003. Upgrades2002 - 2003. UpgradesAdditional EIDE PC Server with 20 x Additional EIDE PC Server with 20 x 200 GB disks200 GB disks Powerful machine (Dual Xeon). 4 RAID5 Powerful machine (Dual Xeon). 4 RAID5

partitions allowing 3 TB of disk spacepartitions allowing 3 TB of disk space

PC server for Oracle/DB with 12 x 200 PC server for Oracle/DB with 12 x 200 GB disksGB disks To contain event databaseTo contain event database

HP PC Server with 6 x 142 GB SCSI HP PC Server with 6 x 142 GB SCSI disksdisksSTK Tape Library upgrade from 20 to STK Tape Library upgrade from 20 to 40 slots40 slots Now allows to store up to 8 TB of dataNow allows to store up to 8 TB of data

2004. Financed 2004. Financed Ultrium2 Tape Drive for STK Tape Ultrium2 Tape Drive for STK Tape LibraryLibrary Up to 400 GB/cartridge, up to 70 MB/s Up to 400 GB/cartridge, up to 70 MB/s

transfer ratetransfer rate

~10 PC Clients~10 PC Clients Rack mount Dual Xeon processor Rack mount Dual Xeon processor

machinesmachines

EIDE Disk ServerEIDE Disk ServerIntel SE7500CW2Intel SE7500CW2Dual Xeon 2 GHzDual Xeon 2 GHz1 GB RAM1 GB RAM2 x 40 GB + 20 x 200 GB 2 x 40 GB + 20 x 200 GB

ATAATANetgear GA 621Netgear GA 621

Oracle ServerOracle ServerSuperMicro X5DP8-G2SuperMicro X5DP8-G2Dual Xeon 2.4 GHzDual Xeon 2.4 GHz2 GB RAM2 GB RAM2 x 20 GB + 12 x 200 GB ATA2 x 20 GB + 12 x 200 GB ATA3com 3C996-SX3com 3C996-SX

HP Proliant ML530G2HP Proliant ML530G2Dual Xeon 2.8 GHzDual Xeon 2.8 GHz2 GB RAM2 GB RAM2 x 36 + 6 x 146.8 SCSI2 x 36 + 6 x 146.8 SCSIGigabitGigabit

Page 10: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 1010

ACID Farm w.r.t. CERN farm: ACID Farm w.r.t. CERN farm: HardwareHardware

The choices (1) (2)The choices (1) (2)Clients. No alternatives due to cost difference: use Clients. No alternatives due to cost difference: use PCs. But…PCs. But… At CERN there are short hardware upgrade periods At CERN there are short hardware upgrade periods use use

“old”, good quality (e.g. Intel chipsets), well Linux tested “old”, good quality (e.g. Intel chipsets), well Linux tested (certified) hardware(certified) hardware

Here hardware lifetime is longer Here hardware lifetime is longer use “recent” hardware use “recent” hardware (as it becomes “dated” really fastly), middle quality (e.g. VIA (as it becomes “dated” really fastly), middle quality (e.g. VIA chipset, for cost reasons), may be not yet completely Linux chipset, for cost reasons), may be not yet completely Linux certifiedcertified

EIDE disk server shows a great performance/cost EIDE disk server shows a great performance/cost ratioratio Not completely tested at beginning, but looked nice and Not completely tested at beginning, but looked nice and

the difference in cost with SCSI based servers (a factor the difference in cost with SCSI based servers (a factor three) looked too attractivethree) looked too attractive

The SunThe Sun Also at CERN the is a SUNDEV cluster made available for Also at CERN the is a SUNDEV cluster made available for

code quality checking. In addition, there are some services code quality checking. In addition, there are some services still run on Suns for stability or commercial software still run on Suns for stability or commercial software requirement reasons requirement reasons

Page 11: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 1111

ACID Farm w.r.t. CERN Farm: ACID Farm w.r.t. CERN Farm: SoftwareSoftware

Requirements and solutions (1) (2)Requirements and solutions (1) (2)Compatible as much as possibleCompatible as much as possible Programs should run without recompilation Programs should run without recompilation Use same kernel and Use same kernel and

compilerscompilers Users should find similar environment Users should find similar environment Use same Linux Use same Linux

distributiondistribution Use CERN patches if they helpUse CERN patches if they help

Independent as much as possibleIndependent as much as possible Do not use too-CERN-specific tools like SUE (hard to port, not so Do not use too-CERN-specific tools like SUE (hard to port, not so

useful)useful) Use official distributions (RedHat) and not CERN “adapted” onesUse official distributions (RedHat) and not CERN “adapted” ones Do not use CERN patches if they do not helpDo not use CERN patches if they do not help Use INFN-Trieste (e.g. LinuxUpdate Use INFN-Trieste (e.g. LinuxUpdate [L.Strizzolo, T.Macorini][L.Strizzolo, T.Macorini] , local , local

CUPS implementation CUPS implementation [L.Strizzolo] [L.Strizzolo] ) or INFN solutions whenever ) or INFN solutions whenever availableavailable

Chose something else if nothing available or simply if there Chose something else if nothing available or simply if there is something better around:is something better around: CERN batch solution too expensive (LSF), nothing interesting at CERN batch solution too expensive (LSF), nothing interesting at

INFN level INFN level use use SGESGE: free, good, supported: free, good, supported Monitoring: Monitoring: BigBrotherBigBrother is fee and looks nice (1) (2) (3) (4) is fee and looks nice (1) (2) (3) (4) Software documenting too: found Software documenting too: found DoxygenDoxygen, it is so good that it was , it is so good that it was

subsequently adopted by CERNsubsequently adopted by CERN

Page 12: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 1212

ACID w.r.t. CERN Farm: Commercial ACID w.r.t. CERN Farm: Commercial SoftwareSoftware

We try to avoid it, if possible (it costs and it is source of We try to avoid it, if possible (it costs and it is source of troubles)troubles)CERN attempt to go for “commercial-only software” CERN attempt to go for “commercial-only software” dramatically failed!dramatically failed! In general: too difficult to interface to HEP environmentIn general: too difficult to interface to HEP environment In general: it never completely fits with HEP requirementsIn general: it never completely fits with HEP requirements In general: not able to follow the fast Linux and GNU software In general: not able to follow the fast Linux and GNU software

evolution (e.g. compiler: we are forced to use quite outdated and evolution (e.g. compiler: we are forced to use quite outdated and now unsupported gcc compilers. Objectivity/DB needed gcc 2.95.2, now unsupported gcc compilers. Objectivity/DB needed gcc 2.95.2, ORACLE needs gcc 2.95.3 or 2.96; current gcc version is 3.3)ORACLE needs gcc 2.95.3 or 2.96; current gcc version is 3.3)

Expansive or whit unsatisfactory support (and, in any case, no Expansive or whit unsatisfactory support (and, in any case, no source code available: so no way to fix problems by ourselves)source code available: so no way to fix problems by ourselves)

So, the current idea is to use commercial software only So, the current idea is to use commercial software only where there are not alternativeswhere there are not alternatives Basically only DBMS (Basically only DBMS (Objectivity/DB 6 Objectivity/DB 6 before,before, ORACLE 9i ORACLE 9i after): too after): too

difficult to develop an HEP specific DBMS. Well, free DBMS are difficult to develop an HEP specific DBMS. Well, free DBMS are available too (e.g. MySQL), but it is too dangerous to follow a available too (e.g. MySQL), but it is too dangerous to follow a solution different with the CERN one on this subject… solution different with the CERN one on this subject…

Page 13: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 1313

ACID w.r.t. CERN Farm: HEP Linux, ACID w.r.t. CERN Farm: HEP Linux, what is going onwhat is going on

Recent (~2003) RedHat change of philosophyRecent (~2003) RedHat change of philosophy Free distribution Free distribution “Fedora Project”“Fedora Project” Free distribution with a release period of 4-6 month (too fast for HEP Free distribution with a release period of 4-6 month (too fast for HEP

needs) and just 3 months support/patching of previous release (too needs) and just 3 months support/patching of previous release (too short for HEP needs)short for HEP needs)

Commercial distribution Commercial distribution “Enterprise”“Enterprise” Commercial distribution with 5 years support of previous release but Commercial distribution with 5 years support of previous release but

too expensive!too expensive!

HEP ReactionsHEP ReactionsMandate to the 3 HEP big labs to negotiate with RedHat, but at Mandate to the 3 HEP big labs to negotiate with RedHat, but at the end…the end…FNALFNAL Rebuild RHEL from source (legal if done without violating RedHat Rebuild RHEL from source (legal if done without violating RedHat

copyrights!) LTS 3.0.1 (now available also cleared from FNAL specifics copyrights!) LTS 3.0.1 (now available also cleared from FNAL specifics and renamed HEPL). FNAL would like to collaborate with other HEP labsand renamed HEPL). FNAL would like to collaborate with other HEP labs

SLACSLAC Negotiated with RedHat “via” DOE. For one year RHEL will be used. Negotiated with RedHat “via” DOE. For one year RHEL will be used.

And after, who knows?And after, who knows?

CERNCERN As FNAL (CEL3 rebuild) as main line. But some RHEL3-WS (~200) is As FNAL (CEL3 rebuild) as main line. But some RHEL3-WS (~200) is

being bought. CEL3 is now under certification (to be finalized by 2Q2004 being bought. CEL3 is now under certification (to be finalized by 2Q2004 or so). or so).

Page 14: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 1414

ACID w.r.t. CERN Farm: software, what ACID w.r.t. CERN Farm: software, what will changewill change

Keep CERN compatibility. Will it be easier? Expensive?Keep CERN compatibility. Will it be easier? Expensive?GoodGood CERN port will be less specific (no more SUE, etc.)CERN port will be less specific (no more SUE, etc.) No more “alternative gcc” compilers (if possible)No more “alternative gcc” compilers (if possible) But with additional “wanted” packages (PINE, …) no more available But with additional “wanted” packages (PINE, …) no more available

from RedHat distribution to avoid license violations. from RedHat distribution to avoid license violations. ACID could probably use CERN distribution without major problems (to ACID could probably use CERN distribution without major problems (to

be checked) instead of use RedHat distribution plus add-ons. be checked) instead of use RedHat distribution plus add-ons.

And BadAnd Bad The port will be supported for 1-2 years. And after?The port will be supported for 1-2 years. And after? The RHEL option still present. That could mean extra costs for software The RHEL option still present. That could mean extra costs for software

(now we use RHEL (AS2.1) just on the ORACLE server machine). In that (now we use RHEL (AS2.1) just on the ORACLE server machine). In that case an I.N.F.N. wide license solution would be a better solution. Or we case an I.N.F.N. wide license solution would be a better solution. Or we could try to user FANL HEPL. We will see… could try to user FANL HEPL. We will see…

Page 15: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 1515

Farm management: man power costs Farm management: man power costs (SW)(SW)

Distribution UpgradeDistribution UpgradeIt is a major task as a local certification is needed tooIt is a major task as a local certification is needed too All applications need to be testedAll applications need to be tested All nodes need to be re-installed from scratchAll nodes need to be re-installed from scratch In general it requires more than a month preparation timeIn general it requires more than a month preparation time Not too frequent: one every few years (~2)Not too frequent: one every few years (~2)

Software InstallationSoftware InstallationComplexity and test-debug period depend on packageComplexity and test-debug period depend on package Could be a strong work (e.g. CASTOR/HSM porting: many months of Could be a strong work (e.g. CASTOR/HSM porting: many months of

work)work) Time-to-time, upgrades/updates are neededTime-to-time, upgrades/updates are needed

PatchingPatchingIn general simple but quite frequent (security patches)In general simple but quite frequent (security patches) Could need a lot of time (e.g. as we use a locally patched kernel, we Could need a lot of time (e.g. as we use a locally patched kernel, we

need a complete kernel recompilation after every official patch) need a complete kernel recompilation after every official patch) And the risk of troubles after a patch is not negligible: in particular And the risk of troubles after a patch is not negligible: in particular

after Kernel updatesafter Kernel updates

Page 16: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 1616

Farm management: man power costs Farm management: man power costs (HW)(HW)

New hardwareNew hardwarePurchasePurchase Product choice, offers requests, “CONSIP”, …Very time consuming Product choice, offers requests, “CONSIP”, …Very time consuming

and generally boringand generally boring

Installation and/or integrationInstallation and/or integration In general non complex, but in some cases needs timeIn general non complex, but in some cases needs time

MaintenanceMaintenanceMany parts of the farm are no more covered by warranty nor Many parts of the farm are no more covered by warranty nor under outsourced maintenanceunder outsourced maintenance Broken parts (disks, boards, …) need to be replaced by hand. That Broken parts (disks, boards, …) need to be replaced by hand. That

takes a lot of time (1)takes a lot of time (1) An Example:An Example:

MicroStar 694D ProMicroStar 694D Pro mainboards mount bad quality electrolytic capacitors (from mainboards mount bad quality electrolytic capacitors (from TAYEHTAYEH). Over 11 boards, on 7 there were failures due to that capacitors ). Over 11 boards, on 7 there were failures due to that capacitors leakage. Intervention requires a complete PC dismount, board removal, leakage. Intervention requires a complete PC dismount, board removal, capacitor replacement and re-mount. On two boards capacitor failure damaged capacitor replacement and re-mount. On two boards capacitor failure damaged following electronics: in those cases mainboard replacement where necessary.following electronics: in those cases mainboard replacement where necessary.

Power loss (HW failures were many times due to Power loss (HW failures were many times due to overheating).overheating). Quite (better: too) frequent in AREA. No cooling for long periods Quite (better: too) frequent in AREA. No cooling for long periods

with consequent machines overheating (In addition, as I always said, with consequent machines overheating (In addition, as I always said, that T02 room is definitively too small compared to the hardware that T02 room is definitively too small compared to the hardware installed inside, this will fortunately change soon). installed inside, this will fortunately change soon).

Page 17: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 1717

The good and the badThe good and the badAs said: the first Linux Compute Farm installed and managed in an As said: the first Linux Compute Farm installed and managed in an INFN LabINFN LabFirst COMPASS home-lab farm in productionFirst COMPASS home-lab farm in productionOne of the first CASTOR/HSM installation outside CERNOne of the first CASTOR/HSM installation outside CERN and probably the first one in productionand probably the first one in production

First “in production” ORACLE database replica of part of First “in production” ORACLE database replica of part of (COMPASS) events outside CERN (COMPASS) events outside CERN Heavily used by COMPASS-Trieste groupHeavily used by COMPASS-Trieste group Data analysis, Monte Carlo production, RICH software development and Data analysis, Monte Carlo production, RICH software development and

analysis, …analysis, …

““Borrowed” for other Trieste groups works (LEP, …) Borrowed” for other Trieste groups works (LEP, …)

It is an “in production” apparatusIt is an “in production” apparatus Interventions have to be immediate, quick (& Interventions have to be immediate, quick (& NOTNOT dirt) dirt) It requires a continuous monitoring: i.e. someone always has to be present It requires a continuous monitoring: i.e. someone always has to be present

“nearby T02”“nearby T02” It always “evolve” (software updates, hardware upgrades) and that It always “evolve” (software updates, hardware upgrades) and that

requires manpowerrequires manpower It is fragile: the probability of failures is highIt is fragile: the probability of failures is high Parts of software need to be updated and checked very frequently (even Parts of software need to be updated and checked very frequently (even

every day or so)every day or so) It is difficult to have a day without need of interventions somewhere inside It is difficult to have a day without need of interventions somewhere inside

the farmthe farm

Page 18: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 1818

What nextWhat next

A new project: the A new project: the “Farm di Sezione”“Farm di Sezione”To (try to) merge all local farms in a kind of To (try to) merge all local farms in a kind of unique unique entityentity. . It is again something relatively new inside INFN sitesIt is again something relatively new inside INFN sites

It involves It involves Gruppo CalcoloGruppo Calcolo and several experiments and several experiments people from existing farms (ALICE and COMPASS) and people from existing farms (ALICE and COMPASS) and new onesnew onesDiscussion started: to find common requirements and Discussion started: to find common requirements and evaluate incompatibilitiesevaluate incompatibilitiesPlace was found: T02 Place was found: T02 T01+T02T01+T02Cooling is being poweredCooling is being poweredSome hardware was already acquiredSome hardware was already acquiredR&D will start soon (compatibility tests between R&D will start soon (compatibility tests between different present farms environments, etc.)different present farms environments, etc.)Consequences on the ACIDs: too early to say anything, Consequences on the ACIDs: too early to say anything, we will see…we will see…

Page 19: Benigno Gobbo 1 Consiglio di Sezione 15 marzo 2004 More than Three Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Consiglio di SezioneConsiglio di Sezione15 marzo 200415 marzo 2004 Benigno Gobbo Benigno Gobbo 1919

Acknowledges and ConclusionsAcknowledges and ConclusionsThanks toThanks to

R. BirsaR. Birsa Sun ManagementSun Management Help in software installation and Help in software installation and debuggingdebugging (e.g. CASTOR would (e.g. CASTOR would

never be installed without his accurate work on it) never be installed without his accurate work on it)

V. DuicV. Duic Data (DB) import, job parallelization tools Data (DB) import, job parallelization tools

All people ofAll people of Gruppo Calcolo Gruppo Calcolo Offer requestsOffer requests ConsultancyConsultancy “ “Linux Update”Linux Update”

To concludeTo concludeThis farm shows that at INFN-Trieste there is a not negligible IT This farm shows that at INFN-Trieste there is a not negligible IT knowledge (compared to other INFN sites)knowledge (compared to other INFN sites)Computing is becoming more and more relevant in HEP Computing is becoming more and more relevant in HEP experiments. It will probably be dominant (in good and bad) at experiments. It will probably be dominant (in good and bad) at LHCLHCUnfortunately INFN looks Unfortunately INFN looks NOTNOT so pioneering on that field… so pioneering on that field…