building large scale fabrics – a summary

Building Large Scale Fabrics – A Summary

Marcel Kunze, FZK

ACAT 2002 Moscow Marcel Kunze - FZK

ObservationEverybody seems to need unprecedented amount of CPU, Disk and Network b/wTrend to PC based computing fabrics and commodity hardware

LCG (CERN), L. Robertson CDF (Fermilab), M. Neubauer D0 (FermiLab), I. Terekhov Belle (KEK), P. Krokovny Hera-B (DESY), J. Hernandez Ligo, P. Shawhan Virgo, D. Busculic AMS, A.KlimentovConsiderable savings in cost wrt. RISC based farm:Not enough ‘bang for the buck’ (M. Neubauer)


AMS02 Benchmarks

Executive time of AMS “standard” job compare to CPU clock1) V.Choutko, A.Klimentov AMS note 2001-11-01

1)

Brand, CPU , Memory

Intel PII dual-CPU 450 MHz, 512 MB RAM

OS/Compiler

RH Linux 6.2 / gcc 2.95

“Sim”

1

“Rec”

1

Intel PIII dual-CPU 933 MHz, 512 MB RAM RH Linux 6.2 / gcc 2.95 0.54 0.54

Compaq, Quad α-ev67 600 MHz, 2 GB RAM RH Linux 6.2 / gcc 2.95 0.58 0.59

AMD Athlon,1.2GHz, 256 MB RAM RH Linux 6.2 / gcc 2.95 0.39 0.34

Intel Pentium IV 1.5GHz, 256 MB RAM RH Linux 6.2 / gcc 2.95 0.44 0.58

Compaq dual-CPU PIV Xeon 1.7GHz, 2GB RAM RH Linux 6.2 / gcc 2.95 0.32 0.39

Compaq dual α-ev68 866MHz, 2GB RAM Tru64 Unix/ cxx 6.2 0.23 0.25

Elonex Intel dual-CPU PIV Xeon 2GHz, 1GB RAM RH Linux 7.2 / gcc 2.95 0.29 0.35

AMD Athlon 1800MP, dual-CPU 1.53GHz, 1GB RAM RH Linux 7.2 / gcc 2.95 0.24 0.23

8 CPU SUN-Fire-880, 750MHz, 8GB RAM Solaris 5.8/C++ 5.2 0.52 0.45

24 CPU Sun Ultrasparc-III+, 900MHz, 96GB RAM RH Linux 6.2 / gcc 2.95 0.43 0.39

Compaq α-ev68 dual 866MHz, 2GB RAM RH Linux 7.1 / gcc 2.95 0.22 0.23


SWITCH

C-IXP

WHO

TEN-155

KPNQwest

RENATER National ResearchNetworks

Mission Oriented Link & USLIC

Public

IN2P3

JEG (Japan)GenesisProject

CERN

Fabrics and Networks: Commodity Equipment

Needed for LHC at CERN in 2006: Storage

Raw recording rate 0.1 – 1 GB/secAccumulating at 5-8 PetaBytes/year

10 PetaBytes of diskProcessing

200’000 of today’s (2001) fastest PCs

Networks5-10 Gbps between main Grid

nodesDistributed computing effort to

avoid congestion: 1/3 at CERN 2/3 elsewhere


PC Cluster 5(Belle)

1U serverPentium III 1.2GHz256 CPU(128 nodes)


PC Cluster 6 Blade server: LP Pentium III 700MHz

40CPU (40 nodes)

3U


Disk Storage


IDE Performance


Basic QuestionsCompute farms contain several 1000s of computing elementsStorage farms contain 1000s of disk drives

How to build scalable systems ?How to build reliable systems ?How to operate and maintain large fabrics ?How to recover from errors ?

EDG deals with the issue (P. Kunszt)IBM deals with the issue (N. Zheleznykh) Project Eliza: Self healing clusters

Several ideas and tools are already on the market


Storage Scalability

Difficult to scale up to systems of 1000s of components and keep single system image:NFS-Automounter, Symbolic links etc.(M.Neubauer, CAF: ROOTD does not need this and allows for direct worldwide access to distributed files w/o mounts)

Scalability in size and throughput by means of storage virtualisationAllows to set up non-TCP/IP based systems to handle multi-GB/s


Virtualisation of Storage

InternetIntranet

Data Servers mount virtual storage as SCSI-Device

Storage Area Network(FCAL, InfiniBand,…)

InputLoad balancing

switch

Shared Data Access

(Oracle, PROOF)

Scalability

200 MB/s sustained


Storage Elements(M. Gasthuber)

PNFS = Perfectly Normal FileSystem Store MetaData with the Data 8 hierarchies of file tags

Migration of data (hierarchical storage systems): dCache Development of DESY and FermiLab ACLs, Kerberos, ROOT-aware Web-Monitoring Cached as well as direct tape access Fail-safe


Necessary admin. Tools(A. Manabe)

System (SW) Installation /update Dolly++ (Image cloning)

Configuration Arusha (http://ark.sourceforge.net) LCFGng (http://www.lcfg.org)

Status Monitoring/ System Health Check CPU/memory/disk/network utilization: Ganglia*1,plantir*2

(Sub-)system service sanity check: Pikt*3/Pica*4/cfengine*1 http://ganglia.sourceforge.net *2 http://www.netsonde.com*3 http://pikt.org *4 http://pica.sourceforge.net/wtf.html

Command Execution WANI: WEB base remote command executer

http://www.netsonde.com/

http://pikt.org/

http://pikt.org/

http://pica.sourceforge.net/wtf.html

http://pica.sourceforge.net/wtf.html


WANI is implemented on `Webmin’ GUI

Command input

Node selection

Start


Command execution result

Host name

Results from 200nodesin 1 Page


Click here

Stderr output

Click here

Stdout output


CPU Scalability

The current tools scale up to ~1000 CPUs(In the previous example 10000 CPUs would require to check 50 pages)Autonomous operation requiredIntelligent self-healing clusters


Resource Scheduling

Problem: How to access local resources from the Grid ?Local batch queues vs. Global batch queues Extension of Dynamite (Amsterdam university) to

work with Globus: Dynamite-G (I. Shoshmina)

Open Question: How do we deal with interactive applications on the Grid ?


ConclusionsA lot of tools existA lot of work needs yet to be done in the Fabric area in order to get reliable, scalable systems

building large scale fabrics – a summary

Documents

gb ramrh linux

mb ramrh linux

gb ram rh linux

compaq dualcpu piv xeon

mb ramoscompilerrh linux

95sim1rec1intel piii

memoryintel pii dualcpu

compaq dual ev68